ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition

Model Description
Documentation and Implementation
Benchmark Results
Usage
Citation
Contact

Model Description

ChunkFormer-Large-Vie is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on approximately 3000 hours of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found HERE.

!!! Please note that only the [train-subset] was used for tuning the model.

Documentation and Implementation

The Documentation and Implementation of ChunkFormer are publicly available.

Benchmark Results

We evaluate the models using Word Error Rate (WER). To ensure consistency and fairness in comparison, we manually apply Text Normalization, including the handling of numbers, uppercase letters, and punctuation.

Public Models:

STT	Model	#Params	Vivos	Common Voice	VLSP - Task 1	Avg.
1	ChunkFormer	110M	4.18	6.66	14.09	8.31
2	vinai/PhoWhisper-large	1.55B	4.67	8.14	13.75	8.85
3	nguyenvulebinh/wav2vec2-base-vietnamese-250h	95M	10.77	18.34	13.33	14.15
4	openai/whisper-large-v3	1.55B	8.81	15.45	20.41	14.89
5	khanhld/wav2vec2-base-vietnamese-160h	95M	15.05	10.78	31.62	19.16
6	homebrewltd/Ichigo-whisper-v0.1	22M	13.46	23.52	21.64	19.54

Private Models (API):

STT Model VLSP - Task 1

1 ChunkFormer 14.1

2 Viettel 14.5

3 Google 19.5

4 FPT 28.8

STT	Model	VLSP - Task 1
1	ChunkFormer	14.1
2	Viettel	14.5
3	Google	19.5
4	FPT	28.8

Quick Usage

To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

Download the ChunkFormer Repository

git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt

Download the Model Checkpoint from Hugging Face

pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie

This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

Run the model

python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-vie \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Example Output:

[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio

Advanced Usage can be found HERE

Citation

If you use this work in your research, please cite:

@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}
}

Contact

[email protected]

khanhld
/

chunkformer-large-vie

ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition

Table of contents

Model Description

Documentation and Implementation

Benchmark Results

Quick Usage

Citation

Contact

Datasets used to train khanhld/chunkformer-large-vie

Space using khanhld/chunkformer-large-vie 1

Evaluation results