Spaces:
Running
Running
# TTS 赋予数字人真实的语音交互能力 | |
## Edge-TTS | |
Edge-TTS是一个Python库,它使用微软的Azure Cognitive Services来实现文本到语音转换(TTS)。 | |
该库提供了一个简单的API,可以将文本转换为语音,并且支持多种语言和声音。要使用Edge-TTS库,首先需要安装上Edge-TTS库,安装直接使用pip 进行安装即可。 | |
```bash | |
pip install -U edge-tts | |
``` | |
> 如果想更细究使用方式,可参考[https://github.com/rany2/edge-tts](https://github.com/rany2/edge-tts) | |
根据源代码,我编写了一个 `EdgeTTS` 的类,能够更好的使用,并且增加了保存字幕文件的功能,能增加体验感 | |
```python | |
class EdgeTTS: | |
def __init__(self, list_voices = False, proxy = None) -> None: | |
voices = list_voices_fn(proxy=proxy) | |
self.SUPPORTED_VOICE = [item['ShortName'] for item in voices] | |
self.SUPPORTED_VOICE.sort(reverse=True) | |
if list_voices: | |
print(", ".join(self.SUPPORTED_VOICE)) | |
def preprocess(self, rate, volume, pitch): | |
if rate >= 0: | |
rate = f'+{rate}%' | |
else: | |
rate = f'{rate}%' | |
if pitch >= 0: | |
pitch = f'+{pitch}Hz' | |
else: | |
pitch = f'{pitch}Hz' | |
volume = 100 - volume | |
volume = f'-{volume}%' | |
return rate, volume, pitch | |
def predict(self,TEXT, VOICE, RATE, VOLUME, PITCH, OUTPUT_FILE='result.wav', OUTPUT_SUBS='result.vtt', words_in_cue = 8): | |
async def amain() -> None: | |
"""Main function""" | |
rate, volume, pitch = self.preprocess(rate = RATE, volume = VOLUME, pitch = PITCH) | |
communicate = Communicate(TEXT, VOICE, rate = rate, volume = volume, pitch = pitch) | |
subs: SubMaker = SubMaker() | |
sub_file: Union[TextIOWrapper, TextIO] = ( | |
open(OUTPUT_SUBS, "w", encoding="utf-8") | |
) | |
async for chunk in communicate.stream(): | |
if chunk["type"] == "audio": | |
# audio_file.write(chunk["data"]) | |
pass | |
elif chunk["type"] == "WordBoundary": | |
# print((chunk["offset"], chunk["duration"]), chunk["text"]) | |
subs.create_sub((chunk["offset"], chunk["duration"]), chunk["text"]) | |
sub_file.write(subs.generate_subs(words_in_cue)) | |
await communicate.save(OUTPUT_FILE) | |
# loop = asyncio.get_event_loop_policy().get_event_loop() | |
# try: | |
# loop.run_until_complete(amain()) | |
# finally: | |
# loop.close() | |
asyncio.run(amain()) | |
with open(OUTPUT_SUBS, 'r', encoding='utf-8') as file: | |
vtt_lines = file.readlines() | |
# 去掉每一行文字中的空格 | |
vtt_lines_without_spaces = [line.replace(" ", "") if "-->" not in line else line for line in vtt_lines] | |
# print(vtt_lines_without_spaces) | |
with open(OUTPUT_SUBS, 'w', encoding='utf-8') as output_file: | |
output_file.writelines(vtt_lines_without_spaces) | |
return OUTPUT_FILE, OUTPUT_SUBS | |
``` | |
同时在`src`文件夹下,写了一个简易的`WebUI` | |
```bash | |
python app.py | |
``` | |
 | |
## PaddleTTS | |
在实际使用过程中,可能会遇到需要离线操作的情况。由于Edge TTS需要在线环境才能生成语音,因此我们选择了同样开源的PaddleSpeech作为文本到语音(TTS)的替代方案。虽然效果可能有所不同,但PaddleSpeech支持离线操作。更多信息可参考PaddleSpeech的GitHub页面:[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)。 | |
### 声码器说明 | |
PaddleSpeech预置了三种声码器:【PWGan】【WaveRnn】【HifiGan】。这三种声码器在音质和生成速度上有较大差异,用户可根据需求进行选择。我们建议仅使用前两种声码器,因为WaveRNN的生成速度非常慢。 | |
| 声码器 | 音频质量 | 生成速度 | | |
| :-----: | :------: | :----------------: | | |
| PWGan | 中等 | 中等 | | |
| WaveRnn | 高 | 非常慢(耐心等待) | | |
| HifiGan | 低 | 快 | | |
### TTS数据集 | |
PaddleSpeech中的样例主要按数据集分类,我们主要使用的TTS数据集有: | |
- CSMCS(普通话单发音人) | |
- AISHELL3(普通话多发音人) | |
- LJSpeech(英文单发音人) | |
- VCTK(英文多发音人) | |
### PaddleSpeech的TTS模型映射 | |
PaddleSpeech的TTS模型与以下模型相对应: | |
- tts0 - Tacotron2 | |
- tts1 - TransformerTTS | |
- tts2 - SpeedySpeech | |
- tts3 - FastSpeech2 | |
- voc0 - WaveFlow | |
- voc1 - Parallel WaveGAN | |
- voc2 - MelGAN | |
- voc3 - MultiBand MelGAN | |
- voc4 - Style MelGAN | |
- voc5 - HiFiGAN | |
- vc0 - Tacotron2 Voice Clone with GE2E | |
- vc1 - FastSpeech2 Voice Clone with GE2E | |
### 预训练模型列表 | |
以下是PaddleSpeech提供的可通过命令行和Python API使用的预训练模型列表: | |
#### 声学模型 | |
| 模型 | 语言 | | |
| :--------------------------- | :----: | | |
| speedyspeech_csmsc | zh | | |
| fastspeech2_csmsc | zh | | |
| fastspeech2_ljspeech | en | | |
| fastspeech2_aishell3 | zh | | |
| fastspeech2_vctk | en | | |
| fastspeech2_cnndecoder_csmsc | zh | | |
| fastspeech2_mix | mix | | |
| tacotron2_csmsc | zh | | |
| tacotron2_ljspeech | en | | |
| fastspeech2_male | zh | | |
| fastspeech2_male | en | | |
| fastspeech2_male | mix | | |
| fastspeech2_canton | canton | | |
#### 声码器 | |
| 模型 | 语言 | | |
| :----------------- | :--: | | |
| pwgan_csmsc | zh | | |
| pwgan_ljspeech | en | | |
| pwgan_aishell3 | zh | | |
| pwgan_vctk | en | | |
| mb_melgan_csmsc | zh | | |
| style_melgan_csmsc | zh | | |
| hifigan_csmsc | zh | | |
| hifigan_ljspeech | en | | |
| hifigan_aishell3 | zh | | |
| hifigan_vctk | en | | |
| wavernn_csmsc | zh | | |
| pwgan_male | zh | | |
| hifigan_male | zh | | |
根据PaddleSpeech,我编写了一个 `PaddleTTS` 的类,能够更好的使用和运行结果 | |
```python | |
class PaddleTTS: | |
def __init__(self) -> None: | |
pass | |
def predict(self, text, am, voc, spk_id = 174, lang = 'zh', male=False, save_path = 'output.wav'): | |
self.tts = TTSExecutor() | |
use_onnx = True | |
voc = voc.lower() | |
am = am.lower() | |
if male: | |
assert voc in ["pwgan", "hifigan"], "male voc must be 'pwgan' or 'hifigan'" | |
wav_file = self.tts( | |
text = text, | |
output = save_path, | |
am='fastspeech2_male', | |
voc= voc + '_male', | |
lang=lang, | |
use_onnx=use_onnx | |
) | |
return wav_file | |
assert am in ['tacotron2', 'fastspeech2'], "am must be 'tacotron2' or 'fastspeech2'" | |
# 混合中文英文语音合成 | |
if lang == 'mix': | |
# mix只有fastspeech2 | |
am = 'fastspeech2_mix' | |
voc += '_csmsc' | |
# 英文语音合成 | |
elif lang == 'en': | |
am += '_ljspeech' | |
voc += '_ljspeech' | |
# 中文语音合成 | |
elif lang == 'zh': | |
assert voc in ['wavernn', 'pwgan', 'hifigan', 'style_melgan', 'mb_melgan'], "voc must be 'wavernn' or 'pwgan' or 'hifigan' or 'style_melgan' or 'mb_melgan'" | |
am += '_csmsc' | |
voc += '_csmsc' | |
elif lang == 'canton': | |
am = 'fastspeech2_canton' | |
voc = 'pwgan_aishell3' | |
spk_id = 10 | |
print("am:", am, "voc:", voc, "lang:", lang, "male:", male, "spk_id:", spk_id) | |
try: | |
cmd = f'paddlespeech tts --am {am} --voc {voc} --input "{text}" --output {save_path} --lang {lang} --spk_id {spk_id} --use_onnx {use_onnx}' | |
os.system(cmd) | |
wav_file = save_path | |
except: | |
# 语音合成 | |
wav_file = self.tts( | |
text = text, | |
output = save_path, | |
am = am, | |
voc = voc, | |
lang = lang, | |
spk_id = spk_id, | |
use_onnx=use_onnx | |
) | |
return wav_file | |
``` | |