Text-to-Speech
English
Chinese
zhu-han commited on
Commit
ed701ab
·
verified ·
1 Parent(s): e248700

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -3
README.md CHANGED
@@ -1,3 +1,22 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - k2-fsa/TTS_eval_datasets
5
+ language:
6
+ - en
7
+ - zh
8
+ pipeline_tag: text-to-speech
9
+ ---
10
+
11
+ This repository consists of various models for objective evaluation of text-to-speech (TTS) models:
12
+
13
+ - **WER**: Includes [Hubert-based ASR model](https://huggingface.co/facebook/hubert-large-ls960-ft) for LibriSpeech-PC testset, [Paraformer-based ASR model](https://huggingface.co/funasr/paraformer-zh) for Chinese datasets, [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for general English test sets, [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model for English dialogue speech.
14
+
15
+ - **cpWER**: [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model is used to compute concatenated minimum permutation word error rate
16
+ ([cpWER](https://arxiv.org/abs/2507.09318)) for English dialogue speech.
17
+
18
+ - **SIM-o**: A [wavlm-based speaker verification model](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification) is used to compute the speaker similarity between prompt and generated speech.
19
+
20
+ - **cpSIM**: A [speaker diarization model](https://huggingface.co/pyannote/speaker-diarization-3.1) is used along with the above wavlm-based model to compute concatenated maximum permutation speaker similarity ([cpSIM](https://arxiv.org/abs/2507.09318)).
21
+
22
+ - **UTMOS**: The mos prediction model [UTMOS](https://github.com/sarulab-speech/UTMOS22) is used.