File size: 4,559 Bytes
0ecb314
4716077
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ecb314
4716077
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ecb314
4716077
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---

language: en
datasets:
- librispeech
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
- long-form transcription
- librispeech
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer-Large-En-Libri-960h
  results:
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: test-clean
      type: librispeech
      args: en
    metrics:
       - name: Test WER
         type: wer
         value: 2.69
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: test-other
      type: librispeech
      args: en
    metrics:
       - name: Test WER
         type: wer
         value: 6.89
---


# **ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset**

[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](paper.pdf)

---
## Table of contents
1. [Model Description](#description)
2. [Documentation and Implementation](#implementation)
3. [Benchmark Results](#benchmark)
4. [Usage](#usage)
6. [Citation](#citation)
7. [Contact](#contact)

---
<a name = "description" ></a>
## Model Description
**ChunkFormer-Large-En-Libri-960h** is an English Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research.

---
<a name = "implementation" ></a>
## Documentation and Implementation
The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.

---
<a name = "benchmark" ></a>
## Benchmark Results
We evaluate the models using **Word Error Rate (WER)**. To ensure a fair comparison, all models are trained exclusively with the [**WENET**](https://github.com/wenet-e2e/wenet) framework.

| STT | Model                 | Test-Clean | Test-Other | Avg.  |
|-----|-----------------------|------------|------------|------ |
|  1  | **ChunkFormer**       | 2.69       | 6.89       | 4.79  |
|  2  | **Efficient Conformer** | 2.71     | 6.95       | 4.83  |
|  3  | **Conformer**         | 2.77       | 6.93       | 4.85  | 
|  4  | **Squeezeformer**     | 2.87       | 7.16       | 5.02  |



---
<a name = "usage" ></a>
## Quick Usage
To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps:

1. **Download the ChunkFormer Repository**
```bash

git clone https://github.com/khanld/chunkformer.git

cd chunkformer

pip install -r requirements.txt   

```
2. **Download the Model Checkpoint from Hugging Face**
```bash

pip install huggingface_hub

huggingface-cli download khanhld/chunkformer-large-en-libri-960h --local-dir "./chunkformer-large-en-libri-960h"

```
or
```bash

git lfs install

git clone https://huggingface.co/khanhld/chunkformer-large-en-libri-960h

```
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

3. **Run the model**
```bash

python decode.py \

    --model_checkpoint path/to/local/chunkformer-large-en-libri-960h \

    --long_form_audio path/to/audio.wav \

    --max_duration 14400 \ #in second, default is 1800

    --chunk_size 64 \

    --left_context_size 128 \

    --right_context_size 128

```
**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)


---
<a name = "citation" ></a>
## Citation
If you use this work in your research, please cite:

```bibtex

@inproceedings{chunkformer,

  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},

  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},

  booktitle={ICASSP},

  year={2025}

}

```

---
<a name = "contact"></a>
## Contact
- [email protected]
- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)