ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition

PWC PWC

License: CC BY-NC 4.0 GitHub Paper


Table of contents

  1. Model Description
  2. Documentation and Implementation
  3. Benchmark Results
  4. Usage
  5. Citation
  6. Contact

Model Description

ChunkFormer-Large-Vie is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on approximately 3000 hours of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found HERE.

!!! Please note that only the [train-subset] was used for tuning the model.


Documentation and Implementation

The Documentation and Implementation of ChunkFormer are publicly available.


Benchmark Results

We evaluate the models using Word Error Rate (WER). To ensure consistency and fairness in comparison, we manually apply Text Normalization, including the handling of numbers, uppercase letters, and punctuation.

  1. Public Models:

    STT Model #Params Vivos Common Voice VLSP - Task 1 Avg.
    1 ChunkFormer 110M 4.18 6.66 14.09 8.31
    2 vinai/PhoWhisper-large 1.55B 4.67 8.14 13.75 8.85
    3 nguyenvulebinh/wav2vec2-base-vietnamese-250h 95M 10.77 18.34 13.33 14.15
    4 openai/whisper-large-v3 1.55B 8.81 15.45 20.41 14.89
    5 khanhld/wav2vec2-base-vietnamese-160h 95M 15.05 10.78 31.62 19.16
    6 homebrewltd/Ichigo-whisper-v0.1 22M 13.46 23.52 21.64 19.54
  2. Private Models (API):

    STT Model VLSP - Task 1
    1 ChunkFormer 14.1
    2 Viettel 14.5
    3 Google 19.5
    4 FPT 28.8

Quick Usage

To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

  1. Download the ChunkFormer Repository
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt   
  1. Download the Model Checkpoint from Hugging Face
pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"

or

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie

This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

  1. Run the model
python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-vie \
    --long_form_audio path/to/audio.wav \
    --max_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Advanced Usage can be found HERE


Citation

If you use this work in your research, please cite:

@inproceedings{chunkformer,
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
  booktitle={ICASSP},
  year={2025}
}

Contact

Downloads last month
63
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Datasets used to train khanhld/chunkformer-large-vie

Evaluation results