|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- en |
|
pipeline_tag: automatic-speech-recognition |
|
library_name: nemo |
|
thumbnail: null |
|
tags: |
|
- automatic-speech-recognition |
|
- speech |
|
- audio |
|
- Transducer |
|
- TDT |
|
- FastConformer |
|
- Conformer |
|
- pytorch |
|
- NeMo |
|
- hf-asr-leaderboard |
|
widget: |
|
- example_title: Librispeech sample 1 |
|
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac |
|
- example_title: Librispeech sample 2 |
|
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac |
|
model-index: |
|
- name: Quantum_STT_V2.0 |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: AMI (Meetings test) |
|
type: edinburghcstr/ami |
|
config: ihm |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 11.16 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Earnings-22 |
|
type: revdotcom/earnings22 |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 11.15 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: GigaSpeech |
|
type: speechcolab/gigaspeech |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 9.74 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: LibriSpeech (clean) |
|
type: librispeech_asr |
|
config: other |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 1.69 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: LibriSpeech (other) |
|
type: librispeech_asr |
|
config: other |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 3.19 |
|
- task: |
|
type: Automatic Speech Recognition |
|
name: automatic-speech-recognition |
|
dataset: |
|
name: SPGI Speech |
|
type: kensho/spgispeech |
|
config: test |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 2.17 |
|
- task: |
|
type: Automatic Speech Recognition |
|
name: automatic-speech-recognition |
|
dataset: |
|
name: tedlium-v3 |
|
type: LIUM/tedlium |
|
config: release1 |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 3.38 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Vox Populi |
|
type: facebook/voxpopuli |
|
config: en |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 5.95 |
|
metrics: |
|
- wer |
|
base_model: |
|
- Quantamhash/Quantum_STT |
|
--- |
|
<div align="center"> |
|
<img src="https://huggingface.co/datasets/Quantamhash/Assets/resolve/main/images/dark_logo.png" |
|
alt="Title card" |
|
style="width: 500px; |
|
height: auto; |
|
object-position: center top;"> |
|
</div> |
|
|
|
# **Quantum_STT_V2.0** |
|
|
|
<style> |
|
img { |
|
display: inline; |
|
} |
|
</style> |
|
|
|
[](#model-architecture) |
|
| [](#model-architecture) |
|
| [](#datasets) |
|
|
|
|
|
## <span style="color:#466f00;">Description:</span> |
|
|
|
`Quantum_STT_V2.0` is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/Quantamhash/Quantum_STT_V2.0 |
|
|
|
This XL variant of the FastConformer [1] architecture integrates the TDT [2] decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. |
|
|
|
**Key Features** |
|
- Accurate word-level timestamp predictions |
|
- Automatic punctuation and capitalization |
|
- Robust performance on spoken numbers, and song lyrics transcription |
|
|
|
|
|
This model is ready for commercial/non-commercial use. |
|
|
|
|
|
## <span style="color:#466f00;">License/Terms of Use:</span> |
|
|
|
GOVERNING TERMS: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license. |
|
|
|
|
|
### <span style="color:#466f00;">Deployment Geography:</span> |
|
Global |
|
|
|
|
|
### <span style="color:#466f00;">Use Case:</span> |
|
|
|
This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms. |
|
|
|
|
|
### <span style="color:#466f00;">Release Date:</span> |
|
|
|
14/05/2025 |
|
|
|
### <span style="color:#466f00;">Model Architecture:</span> |
|
|
|
**Architecture Type**: |
|
|
|
FastConformer-TDT |
|
|
|
**Network Architecture**: |
|
|
|
* This model was developed based on [FastConformer encoder](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) architecture[1] and TDT decoder[2] |
|
* This model has 600 million model parameters. |
|
|
|
### <span style="color:#466f00;">Input:</span> |
|
- **Input Type(s):** 16kHz Audio |
|
- **Input Format(s):** `.wav` and `.flac` audio formats |
|
- **Input Parameters:** 1D (audio signal) |
|
- **Other Properties Related to Input:** Monochannel audio |
|
|
|
### <span style="color:#466f00;">Output:</span> |
|
- **Output Type(s):** Text |
|
- **Output Format:** String |
|
- **Output Parameters:** 1D (text) |
|
- **Other Properties Related to Output:** Punctuations and Capitalizations included. |
|
|
|
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. |
|
|
|
## <span style="color:#466f00;">How to Use this Model:</span> |
|
|
|
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version. |
|
```bash |
|
pip install -U nemo_toolkit["asr"] |
|
``` |
|
The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. |
|
|
|
#### Automatically instantiate the model |
|
|
|
```python |
|
import nemo.collections.asr as nemo_asr |
|
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="Quantamhash/Quantum_STT_V2.0") |
|
``` |
|
|
|
#### Transcribing using Python |
|
First, let's get a sample |
|
```bash |
|
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav |
|
``` |
|
Then simply do: |
|
```python |
|
output = asr_model.transcribe(['2086-149220-0033.wav']) |
|
print(output[0].text) |
|
``` |
|
|
|
#### Transcribing with timestamps |
|
|
|
To transcribe with timestamps: |
|
```python |
|
output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True) |
|
# by default, timestamps are enabled for char, word and segment level |
|
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample |
|
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps |
|
char_timestamps = output[0].timestamp['char'] # char level timestamps |
|
|
|
for stamp in segment_timestamps: |
|
print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}") |
|
``` |
|
|
|
|
|
## <span style="color:#466f00;">Software Integration:</span> |
|
|
|
**Runtime Engine(s):** |
|
* NeMo 2.2 |
|
|
|
|
|
**[Preferred/Supported] Operating System(s):** |
|
|
|
- Linux |
|
|
|
**Hardware Specific Requirements:** |
|
|
|
Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports. |
|
|
|
#### Model Version |
|
|
|
Current version: Quantum_STT_V2.0. Previous versions can be [accessed](https://huggingface.co/Quantamhash/Quantum_STT) here. |
|
|
|
## <span style="color:#466f00;">Performance</span> |
|
|
|
#### Huggingface Open-ASR-Leaderboard Performance |
|
The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio. |
|
|
|
### Base Performance |
|
The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model): |
|
|
|
| **Model** | **Avg WER** | **AMI** | **Earnings-22** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI Speech** | **TEDLIUM-v3** | **VoxPopuli** | |
|
|:-------------|:-------------:|:---------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:------------------:|:----------------:|:---------------:| |
|
| Quantum_STT_V2.0 | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | |
|
|
|
### Noise Robustness |
|
Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples: |
|
|
|
| **SNR Level** | **Avg WER** | **AMI** | **Earnings** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI** | **Tedlium** | **VoxPopuli** | **Relative Change** | |
|
|:---------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:| |
|
| Clean | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | |
|
| SNR 50 | 6.04 | 11.11 | 11.12 | 9.74 | 1.70 | 3.18 | 2.18 | 3.34 | 5.98 | +0.25% | |
|
| SNR 25 | 6.50 | 12.76 | 11.50 | 9.98 | 1.78 | 3.63 | 2.54 | 3.46 | 6.34 | -7.04% | |
|
| SNR 5 | 8.39 | 19.33 | 13.83 | 11.28 | 2.36 | 5.50 | 3.91 | 3.91 | 6.96 | -38.11% | |
|
|
|
### Telephony Audio Performance |
|
Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion): |
|
|
|
| **Audio Format** | **Avg WER** | **AMI** | **Earnings** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI** | **Tedlium** | **VoxPopuli** | **Relative Change** | |
|
|:-----------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:| |
|
| Standard 16kHz | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | |
|
| μ-law 8kHz | 6.32 | 11.98 | 11.16 | 10.02 | 1.78 | 3.52 | 2.20 | 3.38 | 6.52 | -4.10% | |
|
|
|
These WER scores were obtained using greedy decoding without an external language model. |