Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,22 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- k2-fsa/TTS_eval_datasets
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
- zh
|
| 8 |
+
pipeline_tag: text-to-speech
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
This repository consists of various models for objective evaluation of text-to-speech (TTS) models:
|
| 12 |
+
|
| 13 |
+
- **WER**: Includes [Hubert-based ASR model](https://huggingface.co/facebook/hubert-large-ls960-ft) for LibriSpeech-PC testset, [Paraformer-based ASR model](https://huggingface.co/funasr/paraformer-zh) for Chinese datasets, [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for general English test sets, [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model for English dialogue speech.
|
| 14 |
+
|
| 15 |
+
- **cpWER**: [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model is used to compute concatenated minimum permutation word error rate
|
| 16 |
+
([cpWER](https://arxiv.org/abs/2507.09318)) for English dialogue speech.
|
| 17 |
+
|
| 18 |
+
- **SIM-o**: A [wavlm-based speaker verification model](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification) is used to compute the speaker similarity between prompt and generated speech.
|
| 19 |
+
|
| 20 |
+
- **cpSIM**: A [speaker diarization model](https://huggingface.co/pyannote/speaker-diarization-3.1) is used along with the above wavlm-based model to compute concatenated maximum permutation speaker similarity ([cpSIM](https://arxiv.org/abs/2507.09318)).
|
| 21 |
+
|
| 22 |
+
- **UTMOS**: The mos prediction model [UTMOS](https://github.com/sarulab-speech/UTMOS22) is used.
|