chunkformer-large-en-libri-960h / README.md

Update README.md

33c3133 verified 19 days ago

4.81 kB

	---
	language: en
	datasets:
	- librispeech
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	tags:
	- transcription
	- audio
	- speech
	- chunkformer
	- asr
	- automatic-speech-recognition
	- long-form transcription
	- librispeech
	license: cc-by-nc-4.0
	model-index:
	- name: ChunkFormer-Large-En-Libri-960h
	results:
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: test-clean
	type: librispeech
	args: en
	metrics:
	- name: Test WER
	type: wer
	value: 2.69
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: test-other
	type: librispeech
	args: en
	metrics:
	- name: Test WER
	type: wer
	value: 6.91
	---

	# ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset
	<style>
	img {
	display: inline;
	}
	</style>
	[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
	[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
	[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)
	[![Model size](https://img.shields.io/badge/Params-110M-lightgrey#model-badge)](#description)

	!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate
	---
	## Table of contents
	1. [Model Description](#description)
	2. [Documentation and Implementation](#implementation)
	3. [Benchmark Results](#benchmark)
	4. [Usage](#usage)
	6. [Citation](#citation)
	7. [Contact](#contact)

	---
	<a name = "description" ></a>
	## Model Description
	ChunkFormer-Large-En-Libri-960h is an English Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research.

	---
	<a name = "implementation" ></a>
	## Documentation and Implementation
	The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.

	---
	<a name = "benchmark" ></a>
	## Benchmark Results
	We evaluate the models using Word Error Rate (WER). To ensure a fair comparison, all models are trained exclusively with the [WENET](https://github.com/wenet-e2e/wenet) framework.

	\| STT \| Model \| Test-Clean \| Test-Other \| Avg. \|
	\|-----\|-----------------------\|------------\|------------\|------ \|
	\| 1 \| ChunkFormer \| 2.69 \| 6.91 \| 4.80 \|
	\| 2 \| Efficient Conformer \| 2.71 \| 6.95 \| 4.83 \|
	\| 3 \| Conformer \| 2.77 \| 6.93 \| 4.85 \|
	\| 4 \| Squeezeformer \| 2.87 \| 7.16 \| 5.02 \|



	---
	<a name = "usage" ></a>
	## Quick Usage
	To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps:

	1. Download the ChunkFormer Repository
	```bash
	git clone https://github.com/khanld/chunkformer.git
	cd chunkformer
	pip install -r requirements.txt
	```
	2. Download the Model Checkpoint from Hugging Face
	```bash
	pip install huggingface_hub
	huggingface-cli download khanhld/chunkformer-large-en-libri-960h --local-dir "./chunkformer-large-en-libri-960h"
	```
	or
	```bash
	git lfs install
	git clone https://huggingface.co/khanhld/chunkformer-large-en-libri-960h
	```
	This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

	3. Run the model
	```bash
	python decode.py \
	--model_checkpoint path/to/local/chunkformer-large-en-libri-960h \
	--long_form_audio path/to/audio.wav \
	--total_batch_duration 14400 \ #in second, default is 1800
	--chunk_size 64 \
	--left_context_size 128 \
	--right_context_size 128
	```
	Example Output:
	```
	[00:00:01.200] - [00:00:02.400]: this is a transcription example
	[00:00:02.500] - [00:00:03.700]: testing the long-form audio
	```
	Advanced Usage can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)


	---
	<a name = "citation" ></a>
	## Citation
	If you use this work in your research, please cite:

	```bibtex
	@inproceedings{chunkformer,
	title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
	author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
	booktitle={ICASSP},
	year={2025}
	}
	```

	---
	<a name = "contact"></a>
	## Contact
	- [email protected]
	- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
	- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)