Text-to-Speech
English
Chinese
File size: 1,568 Bytes
ed701ab
 
 
 
 
 
 
 
 
 
851e555
ed701ab
 
 
 
 
 
 
 
 
 
851e555
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
---
license: apache-2.0
datasets:
- k2-fsa/TTS_eval_datasets
language:
- en
- zh
pipeline_tag: text-to-speech
---

This repository contains models for the objective evaluation of text-to-speech (TTS) models.:

- **WER**: Includes [Hubert-based ASR model](https://huggingface.co/facebook/hubert-large-ls960-ft) for LibriSpeech-PC testset, [Paraformer-based ASR model](https://huggingface.co/funasr/paraformer-zh) for Chinese datasets, [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for general English test sets, [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model for English dialogue speech.

- **cpWER**: [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model is used to compute concatenated minimum permutation word error rate
([cpWER](https://arxiv.org/abs/2507.09318)) for English dialogue speech.

- **SIM-o**: A [wavlm-based speaker verification model](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification) is used to compute the speaker similarity between prompt and generated speech.

- **cpSIM**: A [speaker diarization model](https://huggingface.co/pyannote/speaker-diarization-3.1) is used along with the above wavlm-based model to compute concatenated maximum permutation speaker similarity ([cpSIM](https://arxiv.org/abs/2507.09318)).

- **UTMOS**: The mos prediction model [UTMOS](https://github.com/sarulab-speech/UTMOS22) is used.


For details of the evaluation metrics, see [ZipVoice](https://arxiv.org/abs/2506.13053) and [ZipVoice-Dialog](https://arxiv.org/abs/2507.09318).