|
from pathlib import Path |
|
|
|
|
|
DIR_OUTPUT_REQUESTS = Path("requested_models") |
|
EVAL_REQUESTS_PATH = Path("eval_requests") |
|
|
|
|
|
|
|
|
|
|
|
banner_url = "https://huggingface.co/datasets/reach-vb/random-images/resolve/main/asr_leaderboard.png" |
|
BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>' |
|
|
|
EXPLANATION = """ |
|
### How to Read the Results |
|
* **Average WER ⬇️**: Lower Word Error Rate (WER) is better |
|
* **RTFx ⬆️**: Real-Time Factor - higher means faster processing |
|
|
|
Use the column filter to focus on specific demographics or view all results together. |
|
""" |
|
|
|
EXPLANATION_EDACC = """ |
|
## EdAcc: Evaluating ASR Models Across Global English Accents |
|
|
|
The [Edinburgh International Accents of English Corpus (EdAcc)](https://huggingface.co/datasets/edinburghcstr/edacc) features over 40 distinct English accents from both native (L1) and non-native (L2) speakers. This evaluation helps you: |
|
|
|
* **Compare Gender Performance**: Analyze how models perform across male and female speakers |
|
* **Evaluate Regional Robustness**: Test model accuracy across European, Asian, African, and American accents |
|
* **Assess Real-World Applicability**: Understand performance in natural conversational settings |
|
|
|
The results show that: |
|
* Larger models consistently outperform their smaller counterparts |
|
* Multilingual models often handle accent diversity better than English-only variants |
|
* Distilled models maintain good performance but show slight degradation on challenging accents |
|
""" |
|
|
|
EXPLANATION_AFRI = """ |
|
## AfriSpeech: Testing ASR Robustness on African English Accents |
|
|
|
The [AfriSpeech](https://huggingface.co/datasets/intronhealth/afrispeech-200) Out-of-Distribution (OOD) test set features 20 distinct African English accents not present in common training data. This benchmark: |
|
|
|
* **Challenges Model Generalization**: Tests performance on truly underrepresented accents |
|
* **Reveals Robustness Gaps**: Highlights limitations in current ASR systems |
|
* **Guides Improvement**: Identifies areas needing focused development |
|
|
|
Key findings show: |
|
* Full-sized models significantly outperform distilled versions |
|
* Multilingual models demonstrate better generalization to African accents |
|
* Even top performers show room for improvement on these challenging accents |
|
""" |
|
|
|
TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard </b> </body> </html>" |
|
|
|
INTRODUCTION_TEXT = "📐 Results on [EdAcc Dataset](https://huggingface.co/datasets/edinburghcstr/edacc) split by accents and gender. \ |
|
\nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower the better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher the better)." |
|
|
|
|
|
CITATION_TEXT = """@misc{open-asr-leaderboard, |
|
title = {Open Automatic Speech Recognition Leaderboard}, |
|
author = {Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and others}, |
|
year = 2023, |
|
publisher = {Hugging Face}, |
|
howpublished = "\\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}" |
|
} |
|
""" |
|
|
|
METRICS_TAB_TEXT = """ |
|
Here you will find details about the speech recognition metrics and datasets reported in our leaderboard. |
|
|
|
## Metrics |
|
|
|
Models are evaluated jointly using the Word Error Rate (WER) and Inverse Real Time Factor (RTFx) metrics. The WER metric |
|
is used to assess the accuracy of a system, and the RTFx the inference speed. Models are ranked in the leaderboard based |
|
on their WER, lowest to highest. |
|
|
|
Crucially, the WER and RTFx values are computed for the same inference run using a single script. The implication of this is two-fold: |
|
1. The WER and RTFx values are coupled: for a given WER, one can expect to achieve the corresponding RTFx. This allows the proposer to trade-off lower WER for higher RTFx should they wish. |
|
2. The WER and RTFx values are averaged over all audios in the benchmark (in the order of thousands of audios). |
|
|
|
For details on reproducing the benchmark numbers, refer to the [Open ASR GitHub repository](https://github.com/huggingface/open_asr_leaderboard#evaluate-a-model). |
|
|
|
### Word Error Rate (WER) |
|
|
|
Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage |
|
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**. |
|
|
|
Take the following example: |
|
|
|
| Reference: | the | cat | sat | on | the | mat | |
|
|-------------|-----|-----|---------|-----|-----|-----| |
|
| Prediction: | the | cat | **sit** | on | the | | | |
|
| Label: | ✅ | ✅ | S | ✅ | ✅ | D | |
|
|
|
Here, we have: |
|
* 1 substitution ("sit" instead of "sat") |
|
* 0 insertions |
|
* 1 deletion ("mat" is missing) |
|
|
|
This gives 2 errors in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our |
|
reference (N), which for this example is 6: |
|
|
|
``` |
|
WER = (S + I + D) / N = (1 + 0 + 1) / 6 = 0.333 |
|
``` |
|
|
|
Giving a WER of 0.33, or 33%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions. You can find the evaluation code on our [Github repository](https://github.com/huggingface/open_asr_leaderboard). To read more about how the WER is computed, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/evaluation). |
|
|
|
### Inverse Real Time Factor (RTFx) |
|
|
|
Inverse Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an |
|
model to process a given amount of speech. It is defined as: |
|
``` |
|
RTFx = (number of seconds of audio inferred) / (compute time in seconds) |
|
``` |
|
|
|
Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time. |
|
Thus, **a higher RTFx value indicates lower latency**. |
|
|
|
## How to reproduce our results |
|
|
|
The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible. |
|
Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations. |
|
For more details head over to our repo at: https://github.com/huggingface/open_asr_leaderboard |
|
|
|
P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! ♥️ |
|
|
|
## Benchmark datasets |
|
|
|
Evaluating Speech Recognition systems is a hard problem. We use the multi-dataset benchmarking strategy proposed in the |
|
[ESB paper](https://arxiv.org/abs/2210.13352) to obtain robust evaluation scores for each model. |
|
|
|
ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad |
|
set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains, |
|
acoustic conditions, speaker styles, and transcription requirements. As such, it gives a better indication of how |
|
a model is likely to perform on downstream ASR compared to evaluating it on one dataset alone. |
|
|
|
The ESB score is calculated as a macro-average of the WER scores across the ESB datasets. The models in the leaderboard |
|
are ranked based on their average WER scores, from lowest to highest. |
|
|
|
| Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License | |
|
|-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------| |
|
| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | Audiobook | Narrated | 960 | 11 | 11 | Normalised | CC-BY-4.0 | |
|
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 | |
|
| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | TED talks | Oratory | 454 | 2 | 3 | Normalised | CC-BY-NC-ND 3.0 | |
|
| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | apache-2.0 | |
|
| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech) | Financial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement | |
|
| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22) | Financial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 | |
|
| [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 | |
|
|
|
For more details on the individual datasets and how models are evaluated to give the ESB score, refer to the [ESB paper](https://arxiv.org/abs/2210.13352). |
|
""" |
|
|
|
LEADERBOARD_CSS = """ |
|
#leaderboard-table th .header-content { |
|
white-space: nowrap; |
|
} |
|
""" |
|
|