Spaces:
Running
Running
File size: 8,282 Bytes
7088d16 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
# TTS 赋予数字人真实的语音交互能力
## Edge-TTS
Edge-TTS是一个Python库,它使用微软的Azure Cognitive Services来实现文本到语音转换(TTS)。
该库提供了一个简单的API,可以将文本转换为语音,并且支持多种语言和声音。要使用Edge-TTS库,首先需要安装上Edge-TTS库,安装直接使用pip 进行安装即可。
```bash
pip install -U edge-tts
```
> 如果想更细究使用方式,可参考[https://github.com/rany2/edge-tts](https://github.com/rany2/edge-tts)
根据源代码,我编写了一个 `EdgeTTS` 的类,能够更好的使用,并且增加了保存字幕文件的功能,能增加体验感
```python
class EdgeTTS:
def __init__(self, list_voices = False, proxy = None) -> None:
voices = list_voices_fn(proxy=proxy)
self.SUPPORTED_VOICE = [item['ShortName'] for item in voices]
self.SUPPORTED_VOICE.sort(reverse=True)
if list_voices:
print(", ".join(self.SUPPORTED_VOICE))
def preprocess(self, rate, volume, pitch):
if rate >= 0:
rate = f'+{rate}%'
else:
rate = f'{rate}%'
if pitch >= 0:
pitch = f'+{pitch}Hz'
else:
pitch = f'{pitch}Hz'
volume = 100 - volume
volume = f'-{volume}%'
return rate, volume, pitch
def predict(self,TEXT, VOICE, RATE, VOLUME, PITCH, OUTPUT_FILE='result.wav', OUTPUT_SUBS='result.vtt', words_in_cue = 8):
async def amain() -> None:
"""Main function"""
rate, volume, pitch = self.preprocess(rate = RATE, volume = VOLUME, pitch = PITCH)
communicate = Communicate(TEXT, VOICE, rate = rate, volume = volume, pitch = pitch)
subs: SubMaker = SubMaker()
sub_file: Union[TextIOWrapper, TextIO] = (
open(OUTPUT_SUBS, "w", encoding="utf-8")
)
async for chunk in communicate.stream():
if chunk["type"] == "audio":
# audio_file.write(chunk["data"])
pass
elif chunk["type"] == "WordBoundary":
# print((chunk["offset"], chunk["duration"]), chunk["text"])
subs.create_sub((chunk["offset"], chunk["duration"]), chunk["text"])
sub_file.write(subs.generate_subs(words_in_cue))
await communicate.save(OUTPUT_FILE)
# loop = asyncio.get_event_loop_policy().get_event_loop()
# try:
# loop.run_until_complete(amain())
# finally:
# loop.close()
asyncio.run(amain())
with open(OUTPUT_SUBS, 'r', encoding='utf-8') as file:
vtt_lines = file.readlines()
# 去掉每一行文字中的空格
vtt_lines_without_spaces = [line.replace(" ", "") if "-->" not in line else line for line in vtt_lines]
# print(vtt_lines_without_spaces)
with open(OUTPUT_SUBS, 'w', encoding='utf-8') as output_file:
output_file.writelines(vtt_lines_without_spaces)
return OUTPUT_FILE, OUTPUT_SUBS
```
同时在`src`文件夹下,写了一个简易的`WebUI`
```bash
python app.py
```

## PaddleTTS
在实际使用过程中,可能会遇到需要离线操作的情况。由于Edge TTS需要在线环境才能生成语音,因此我们选择了同样开源的PaddleSpeech作为文本到语音(TTS)的替代方案。虽然效果可能有所不同,但PaddleSpeech支持离线操作。更多信息可参考PaddleSpeech的GitHub页面:[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)。
### 声码器说明
PaddleSpeech预置了三种声码器:【PWGan】【WaveRnn】【HifiGan】。这三种声码器在音质和生成速度上有较大差异,用户可根据需求进行选择。我们建议仅使用前两种声码器,因为WaveRNN的生成速度非常慢。
| 声码器 | 音频质量 | 生成速度 |
| :-----: | :------: | :----------------: |
| PWGan | 中等 | 中等 |
| WaveRnn | 高 | 非常慢(耐心等待) |
| HifiGan | 低 | 快 |
### TTS数据集
PaddleSpeech中的样例主要按数据集分类,我们主要使用的TTS数据集有:
- CSMCS(普通话单发音人)
- AISHELL3(普通话多发音人)
- LJSpeech(英文单发音人)
- VCTK(英文多发音人)
### PaddleSpeech的TTS模型映射
PaddleSpeech的TTS模型与以下模型相对应:
- tts0 - Tacotron2
- tts1 - TransformerTTS
- tts2 - SpeedySpeech
- tts3 - FastSpeech2
- voc0 - WaveFlow
- voc1 - Parallel WaveGAN
- voc2 - MelGAN
- voc3 - MultiBand MelGAN
- voc4 - Style MelGAN
- voc5 - HiFiGAN
- vc0 - Tacotron2 Voice Clone with GE2E
- vc1 - FastSpeech2 Voice Clone with GE2E
### 预训练模型列表
以下是PaddleSpeech提供的可通过命令行和Python API使用的预训练模型列表:
#### 声学模型
| 模型 | 语言 |
| :--------------------------- | :----: |
| speedyspeech_csmsc | zh |
| fastspeech2_csmsc | zh |
| fastspeech2_ljspeech | en |
| fastspeech2_aishell3 | zh |
| fastspeech2_vctk | en |
| fastspeech2_cnndecoder_csmsc | zh |
| fastspeech2_mix | mix |
| tacotron2_csmsc | zh |
| tacotron2_ljspeech | en |
| fastspeech2_male | zh |
| fastspeech2_male | en |
| fastspeech2_male | mix |
| fastspeech2_canton | canton |
#### 声码器
| 模型 | 语言 |
| :----------------- | :--: |
| pwgan_csmsc | zh |
| pwgan_ljspeech | en |
| pwgan_aishell3 | zh |
| pwgan_vctk | en |
| mb_melgan_csmsc | zh |
| style_melgan_csmsc | zh |
| hifigan_csmsc | zh |
| hifigan_ljspeech | en |
| hifigan_aishell3 | zh |
| hifigan_vctk | en |
| wavernn_csmsc | zh |
| pwgan_male | zh |
| hifigan_male | zh |
根据PaddleSpeech,我编写了一个 `PaddleTTS` 的类,能够更好的使用和运行结果
```python
class PaddleTTS:
def __init__(self) -> None:
pass
def predict(self, text, am, voc, spk_id = 174, lang = 'zh', male=False, save_path = 'output.wav'):
self.tts = TTSExecutor()
use_onnx = True
voc = voc.lower()
am = am.lower()
if male:
assert voc in ["pwgan", "hifigan"], "male voc must be 'pwgan' or 'hifigan'"
wav_file = self.tts(
text = text,
output = save_path,
am='fastspeech2_male',
voc= voc + '_male',
lang=lang,
use_onnx=use_onnx
)
return wav_file
assert am in ['tacotron2', 'fastspeech2'], "am must be 'tacotron2' or 'fastspeech2'"
# 混合中文英文语音合成
if lang == 'mix':
# mix只有fastspeech2
am = 'fastspeech2_mix'
voc += '_csmsc'
# 英文语音合成
elif lang == 'en':
am += '_ljspeech'
voc += '_ljspeech'
# 中文语音合成
elif lang == 'zh':
assert voc in ['wavernn', 'pwgan', 'hifigan', 'style_melgan', 'mb_melgan'], "voc must be 'wavernn' or 'pwgan' or 'hifigan' or 'style_melgan' or 'mb_melgan'"
am += '_csmsc'
voc += '_csmsc'
elif lang == 'canton':
am = 'fastspeech2_canton'
voc = 'pwgan_aishell3'
spk_id = 10
print("am:", am, "voc:", voc, "lang:", lang, "male:", male, "spk_id:", spk_id)
try:
cmd = f'paddlespeech tts --am {am} --voc {voc} --input "{text}" --output {save_path} --lang {lang} --spk_id {spk_id} --use_onnx {use_onnx}'
os.system(cmd)
wav_file = save_path
except:
# 语音合成
wav_file = self.tts(
text = text,
output = save_path,
am = am,
voc = voc,
lang = lang,
spk_id = spk_id,
use_onnx=use_onnx
)
return wav_file
```
|