|
--- |
|
language: en |
|
datasets: |
|
- librispeech |
|
metrics: |
|
- wer |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- transcription |
|
- audio |
|
- speech |
|
- chunkformer |
|
- asr |
|
- automatic-speech-recognition |
|
- long-form transcription |
|
- librispeech |
|
license: cc-by-nc-4.0 |
|
model-index: |
|
- name: ChunkFormer-Large-En-Libri-960h |
|
results: |
|
- task: |
|
name: Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: test-clean |
|
type: librispeech |
|
args: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 2.69 |
|
- task: |
|
name: Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: test-other |
|
type: librispeech |
|
args: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 6.91 |
|
--- |
|
|
|
# **ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset** |
|
<style> |
|
img { |
|
display: inline; |
|
} |
|
</style> |
|
[](https://creativecommons.org/licenses/by-nc/4.0/) |
|
[](https://github.com/khanld/chunkformer) |
|
[](https://arxiv.org/abs/2502.14673) |
|
[](#description) |
|
|
|
**!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate** |
|
--- |
|
## Table of contents |
|
1. [Model Description](#description) |
|
2. [Documentation and Implementation](#implementation) |
|
3. [Benchmark Results](#benchmark) |
|
4. [Usage](#usage) |
|
6. [Citation](#citation) |
|
7. [Contact](#contact) |
|
|
|
--- |
|
<a name = "description" ></a> |
|
## Model Description |
|
**ChunkFormer-Large-En-Libri-960h** is an English Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research. |
|
|
|
--- |
|
<a name = "implementation" ></a> |
|
## Documentation and Implementation |
|
The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available. |
|
|
|
--- |
|
<a name = "benchmark" ></a> |
|
## Benchmark Results |
|
We evaluate the models using **Word Error Rate (WER)**. To ensure a fair comparison, all models are trained exclusively with the [**WENET**](https://github.com/wenet-e2e/wenet) framework. |
|
|
|
| STT | Model | Test-Clean | Test-Other | Avg. | |
|
|-----|-----------------------|------------|------------|------ | |
|
| 1 | **ChunkFormer** | 2.69 | 6.91 | 4.80 | |
|
| 2 | **Efficient Conformer** | 2.71 | 6.95 | 4.83 | |
|
| 3 | **Conformer** | 2.77 | 6.93 | 4.85 | |
|
| 4 | **Squeezeformer** | 2.87 | 7.16 | 5.02 | |
|
|
|
|
|
|
|
--- |
|
<a name = "usage" ></a> |
|
## Quick Usage |
|
To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps: |
|
|
|
1. **Download the ChunkFormer Repository** |
|
```bash |
|
git clone https://github.com/khanld/chunkformer.git |
|
cd chunkformer |
|
pip install -r requirements.txt |
|
``` |
|
2. **Download the Model Checkpoint from Hugging Face** |
|
```bash |
|
pip install huggingface_hub |
|
huggingface-cli download khanhld/chunkformer-large-en-libri-960h --local-dir "./chunkformer-large-en-libri-960h" |
|
``` |
|
or |
|
```bash |
|
git lfs install |
|
git clone https://huggingface.co/khanhld/chunkformer-large-en-libri-960h |
|
``` |
|
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory. |
|
|
|
3. **Run the model** |
|
```bash |
|
python decode.py \ |
|
--model_checkpoint path/to/local/chunkformer-large-en-libri-960h \ |
|
--long_form_audio path/to/audio.wav \ |
|
--total_batch_duration 14400 \ #in second, default is 1800 |
|
--chunk_size 64 \ |
|
--left_context_size 128 \ |
|
--right_context_size 128 |
|
``` |
|
Example Output: |
|
``` |
|
[00:00:01.200] - [00:00:02.400]: this is a transcription example |
|
[00:00:02.500] - [00:00:03.700]: testing the long-form audio |
|
``` |
|
**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage) |
|
|
|
|
|
--- |
|
<a name = "citation" ></a> |
|
## Citation |
|
If you use this work in your research, please cite: |
|
|
|
```bibtex |
|
@inproceedings{chunkformer, |
|
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, |
|
author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau}, |
|
booktitle={ICASSP}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
--- |
|
<a name = "contact"></a> |
|
## Contact |
|
- [email protected] |
|
- [](https://github.com/khanld) |
|
- [](https://www.linkedin.com/in/khanhld257/) |
|
|
|
|
|
|
|
|
|
|