Quantum_STT_V2.0 / README.md

Update README.md

a417f9a verified 3 months ago

10.6 kB

	---
	license: cc-by-4.0
	language:
	- en
	pipeline_tag: automatic-speech-recognition
	library_name: nemo
	thumbnail: null
	tags:
	- automatic-speech-recognition
	- speech
	- audio
	- Transducer
	- TDT
	- FastConformer
	- Conformer
	- pytorch
	- NeMo
	- hf-asr-leaderboard
	widget:
	- example_title: Librispeech sample 1
	src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
	- example_title: Librispeech sample 2
	src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
	model-index:
	- name: Quantum_STT_V2.0
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: AMI (Meetings test)
	type: edinburghcstr/ami
	config: ihm
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 11.16
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Earnings-22
	type: revdotcom/earnings22
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 11.15
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: GigaSpeech
	type: speechcolab/gigaspeech
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 9.74
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: LibriSpeech (clean)
	type: librispeech_asr
	config: other
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 1.69
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: LibriSpeech (other)
	type: librispeech_asr
	config: other
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 3.19
	- task:
	type: Automatic Speech Recognition
	name: automatic-speech-recognition
	dataset:
	name: SPGI Speech
	type: kensho/spgispeech
	config: test
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 2.17
	- task:
	type: Automatic Speech Recognition
	name: automatic-speech-recognition
	dataset:
	name: tedlium-v3
	type: LIUM/tedlium
	config: release1
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 3.38
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Vox Populi
	type: facebook/voxpopuli
	config: en
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 5.95
	metrics:
	- wer
	base_model:
	- Quantamhash/Quantum_STT
	---
	<div align="center">
	<img src="https://huggingface.co/datasets/Quantamhash/Assets/resolve/main/images/dark_logo.png"
	alt="Title card"
	style="width: 500px;
	height: auto;
	object-position: center top;">
	</div>

	# Quantum_STT_V2.0

	<style>
	img {
	display: inline;
	}
	</style>

	[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--TDT-blue#model-badge)](#model-architecture)
	\| [![Model size](https://img.shields.io/badge/Params-0.6B-green#model-badge)](#model-architecture)
	\| [![Language](https://img.shields.io/badge/Language-en-orange#model-badge)](#datasets)


	## <span style="color:#466f00;">Description:</span>

	`Quantum_STT_V2.0` is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/Quantamhash/Quantum_STT_V2.0

	This XL variant of the FastConformer [1] architecture integrates the TDT [2] decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass.

	Key Features
	- Accurate word-level timestamp predictions
	- Automatic punctuation and capitalization
	- Robust performance on spoken numbers, and song lyrics transcription


	This model is ready for commercial/non-commercial use.


	## <span style="color:#466f00;">License/Terms of Use:</span>

	GOVERNING TERMS: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license.


	### <span style="color:#466f00;">Deployment Geography:</span>
	Global


	### <span style="color:#466f00;">Use Case:</span>

	This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms.


	### <span style="color:#466f00;">Release Date:</span>

	14/05/2025

	### <span style="color:#466f00;">Model Architecture:</span>

	Architecture Type:

	FastConformer-TDT

	Network Architecture:

	* This model was developed based on [FastConformer encoder](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) architecture[1] and TDT decoder[2]
	* This model has 600 million model parameters.

	### <span style="color:#466f00;">Input:</span>
	- Input Type(s): 16kHz Audio
	- Input Format(s): `.wav` and `.flac` audio formats
	- Input Parameters: 1D (audio signal)
	- Other Properties Related to Input: Monochannel audio

	### <span style="color:#466f00;">Output:</span>
	- Output Type(s): Text
	- Output Format: String
	- Output Parameters: 1D (text)
	- Other Properties Related to Output: Punctuations and Capitalizations included.

	Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

	## <span style="color:#466f00;">How to Use this Model:</span>

	To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
	```bash
	pip install -U nemo_toolkit["asr"]
	```
	The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

	#### Automatically instantiate the model

	```python
	import nemo.collections.asr as nemo_asr
	asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="Quantamhash/Quantum_STT_V2.0")
	```

	#### Transcribing using Python
	First, let's get a sample
	```bash
	wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
	```
	Then simply do:
	```python
	output = asr_model.transcribe(['2086-149220-0033.wav'])
	print(output[0].text)
	```

	#### Transcribing with timestamps

	To transcribe with timestamps:
	```python
	output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True)
	# by default, timestamps are enabled for char, word and segment level
	word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
	segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
	char_timestamps = output[0].timestamp['char'] # char level timestamps

	for stamp in segment_timestamps:
	print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")
	```


	## <span style="color:#466f00;">Software Integration:</span>

	Runtime Engine(s):
	* NeMo 2.2


	[Preferred/Supported] Operating System(s):

	- Linux

	Hardware Specific Requirements:

	Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports.

	#### Model Version

	Current version: Quantum_STT_V2.0. Previous versions can be [accessed](https://huggingface.co/Quantamhash/Quantum_STT) here.

	## <span style="color:#466f00;">Performance</span>

	#### Huggingface Open-ASR-Leaderboard Performance
	The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

	### Base Performance
	The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):

	\| Model \| Avg WER \| AMI \| Earnings-22 \| GigaSpeech \| LS test-clean \| LS test-other \| SPGI Speech \| TEDLIUM-v3 \| VoxPopuli \|
	\|:-------------\|:-------------:\|:---------:\|:------------------:\|:----------------:\|:-----------------:\|:-----------------:\|:------------------:\|:----------------:\|:---------------:\|
	\| Quantum_STT_V2.0 \| 6.05 \| 11.16 \| 11.15 \| 9.74 \| 1.69 \| 3.19 \| 2.17 \| 3.38 \| 5.95 \| - \|

	### Noise Robustness
	Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:

	\| SNR Level \| Avg WER \| AMI \| Earnings \| GigaSpeech \| LS test-clean \| LS test-other \| SPGI \| Tedlium \| VoxPopuli \| Relative Change \|
	\|:---------------\|:-------------:\|:----------:\|:------------:\|:----------------:\|:-----------------:\|:-----------------:\|:-----------:\|:-------------:\|:---------------:\|:-----------------:\|
	\| Clean \| 6.05 \| 11.16 \| 11.15 \| 9.74 \| 1.69 \| 3.19 \| 2.17 \| 3.38 \| 5.95 \| - \|
	\| SNR 50 \| 6.04 \| 11.11 \| 11.12 \| 9.74 \| 1.70 \| 3.18 \| 2.18 \| 3.34 \| 5.98 \| +0.25% \|
	\| SNR 25 \| 6.50 \| 12.76 \| 11.50 \| 9.98 \| 1.78 \| 3.63 \| 2.54 \| 3.46 \| 6.34 \| -7.04% \|
	\| SNR 5 \| 8.39 \| 19.33 \| 13.83 \| 11.28 \| 2.36 \| 5.50 \| 3.91 \| 3.91 \| 6.96 \| -38.11% \|

	### Telephony Audio Performance
	Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):

	\| Audio Format \| Avg WER \| AMI \| Earnings \| GigaSpeech \| LS test-clean \| LS test-other \| SPGI \| Tedlium \| VoxPopuli \| Relative Change \|
	\|:-----------------\|:-------------:\|:----------:\|:------------:\|:----------------:\|:-----------------:\|:-----------------:\|:-----------:\|:-------------:\|:---------------:\|:-----------------:\|
	\| Standard 16kHz \| 6.05 \| 11.16 \| 11.15 \| 9.74 \| 1.69 \| 3.19 \| 2.17 \| 3.38 \| 5.95 \| - \|
	\| μ-law 8kHz \| 6.32 \| 11.98 \| 11.16 \| 10.02 \| 1.78 \| 3.52 \| 2.20 \| 3.38 \| 6.52 \| -4.10% \|

	These WER scores were obtained using greedy decoding without an external language model.