File size: 8,282 Bytes
7088d16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
# TTS 赋予数字人真实的语音交互能力

## Edge-TTS

Edge-TTS是一个Python库,它使用微软的Azure Cognitive Services来实现文本到语音转换(TTS)。

该库提供了一个简单的API,可以将文本转换为语音,并且支持多种语言和声音。要使用Edge-TTS库,首先需要安装上Edge-TTS库,安装直接使用pip 进行安装即可。

```bash
pip install -U edge-tts
```

> 如果想更细究使用方式,可参考[https://github.com/rany2/edge-tts](https://github.com/rany2/edge-tts)



根据源代码,我编写了一个 `EdgeTTS` 的类,能够更好的使用,并且增加了保存字幕文件的功能,能增加体验感

```python
class EdgeTTS:
    def __init__(self, list_voices = False, proxy = None) -> None:
        voices = list_voices_fn(proxy=proxy)
        self.SUPPORTED_VOICE = [item['ShortName'] for item in voices]
        self.SUPPORTED_VOICE.sort(reverse=True)
        if list_voices:
            print(", ".join(self.SUPPORTED_VOICE))

    def preprocess(self, rate, volume, pitch):
        if rate >= 0:
            rate = f'+{rate}%'
        else:
            rate = f'{rate}%'
        if pitch >= 0:
            pitch = f'+{pitch}Hz'
        else:
            pitch = f'{pitch}Hz'
        volume = 100 - volume
        volume = f'-{volume}%'
        return rate, volume, pitch

    def predict(self,TEXT, VOICE, RATE, VOLUME, PITCH, OUTPUT_FILE='result.wav', OUTPUT_SUBS='result.vtt', words_in_cue = 8):
        async def amain() -> None:
            """Main function"""
            rate, volume, pitch = self.preprocess(rate = RATE, volume = VOLUME, pitch = PITCH)
            communicate = Communicate(TEXT, VOICE, rate = rate, volume = volume, pitch = pitch)
            subs: SubMaker = SubMaker()
            sub_file: Union[TextIOWrapper, TextIO] = (
                open(OUTPUT_SUBS, "w", encoding="utf-8")
            )
            async for chunk in communicate.stream():
                if chunk["type"] == "audio":
                    # audio_file.write(chunk["data"])
                    pass
                elif chunk["type"] == "WordBoundary":
                    # print((chunk["offset"], chunk["duration"]), chunk["text"])
                    subs.create_sub((chunk["offset"], chunk["duration"]), chunk["text"])
            sub_file.write(subs.generate_subs(words_in_cue))
            await communicate.save(OUTPUT_FILE)
            
        
        # loop = asyncio.get_event_loop_policy().get_event_loop()
        # try:
        #     loop.run_until_complete(amain())
        # finally:
        #     loop.close()
        asyncio.run(amain())
        with open(OUTPUT_SUBS, 'r', encoding='utf-8') as file:
            vtt_lines = file.readlines()

        # 去掉每一行文字中的空格
        vtt_lines_without_spaces = [line.replace(" ", "") if "-->" not in line else line for line in vtt_lines]
        # print(vtt_lines_without_spaces)
        with open(OUTPUT_SUBS, 'w', encoding='utf-8') as output_file:
            output_file.writelines(vtt_lines_without_spaces)
        return OUTPUT_FILE, OUTPUT_SUBS
```



同时在`src`文件夹下,写了一个简易的`WebUI`

```bash
python app.py
```

![TTS](../docs/TTS.png)

## PaddleTTS

在实际使用过程中,可能会遇到需要离线操作的情况。由于Edge TTS需要在线环境才能生成语音,因此我们选择了同样开源的PaddleSpeech作为文本到语音(TTS)的替代方案。虽然效果可能有所不同,但PaddleSpeech支持离线操作。更多信息可参考PaddleSpeech的GitHub页面:[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)。

### 声码器说明

PaddleSpeech预置了三种声码器:【PWGan】【WaveRnn】【HifiGan】。这三种声码器在音质和生成速度上有较大差异,用户可根据需求进行选择。我们建议仅使用前两种声码器,因为WaveRNN的生成速度非常慢。

| 声码器  | 音频质量 |      生成速度      |
| :-----: | :------: | :----------------: |
|  PWGan  |   中等   |        中等        |
| WaveRnn |    高    | 非常慢(耐心等待) |
| HifiGan |    低    |         快         |

### TTS数据集

PaddleSpeech中的样例主要按数据集分类,我们主要使用的TTS数据集有:

- CSMCS(普通话单发音人)
- AISHELL3(普通话多发音人)
- LJSpeech(英文单发音人)
- VCTK(英文多发音人)

### PaddleSpeech的TTS模型映射

PaddleSpeech的TTS模型与以下模型相对应:

- tts0 - Tacotron2
- tts1 - TransformerTTS
- tts2 - SpeedySpeech
- tts3 - FastSpeech2
- voc0 - WaveFlow
- voc1 - Parallel WaveGAN
- voc2 - MelGAN
- voc3 - MultiBand MelGAN
- voc4 - Style MelGAN
- voc5 - HiFiGAN
- vc0 - Tacotron2 Voice Clone with GE2E
- vc1 - FastSpeech2 Voice Clone with GE2E

### 预训练模型列表

以下是PaddleSpeech提供的可通过命令行和Python API使用的预训练模型列表:

#### 声学模型

| 模型                         |  语言  |
| :--------------------------- | :----: |
| speedyspeech_csmsc           |   zh   |
| fastspeech2_csmsc            |   zh   |
| fastspeech2_ljspeech         |   en   |
| fastspeech2_aishell3         |   zh   |
| fastspeech2_vctk             |   en   |
| fastspeech2_cnndecoder_csmsc |   zh   |
| fastspeech2_mix              |  mix   |
| tacotron2_csmsc              |   zh   |
| tacotron2_ljspeech           |   en   |
| fastspeech2_male             |   zh   |
| fastspeech2_male             |   en   |
| fastspeech2_male             |  mix   |
| fastspeech2_canton           | canton |

#### 声码器

| 模型               | 语言 |
| :----------------- | :--: |
| pwgan_csmsc        |  zh  |
| pwgan_ljspeech     |  en  |
| pwgan_aishell3     |  zh  |
| pwgan_vctk         |  en  |
| mb_melgan_csmsc    |  zh  |
| style_melgan_csmsc |  zh  |
| hifigan_csmsc      |  zh  |
| hifigan_ljspeech   |  en  |
| hifigan_aishell3   |  zh  |
| hifigan_vctk       |  en  |
| wavernn_csmsc      |  zh  |
| pwgan_male         |  zh  |
| hifigan_male       |  zh  |

根据PaddleSpeech,我编写了一个 `PaddleTTS` 的类,能够更好的使用和运行结果

```python
class PaddleTTS:
    def __init__(self) -> None:
        pass
        
    def predict(self, text, am, voc, spk_id = 174, lang = 'zh', male=False, save_path = 'output.wav'):
        self.tts = TTSExecutor()
        
        use_onnx = True
        voc = voc.lower()
        am = am.lower()
        
        if male:
            assert voc in ["pwgan", "hifigan"], "male voc must be 'pwgan' or 'hifigan'"
            wav_file = self.tts(
            text = text,
            output = save_path,
            am='fastspeech2_male',
            voc= voc + '_male',
            lang=lang,
            use_onnx=use_onnx
            )
            return wav_file
    
        assert am in ['tacotron2', 'fastspeech2'], "am must be 'tacotron2' or 'fastspeech2'"
        
        # 混合中文英文语音合成
        if lang == 'mix':
            # mix只有fastspeech2
            am = 'fastspeech2_mix'
            voc += '_csmsc'
        # 英文语音合成
        elif lang == 'en':
            am += '_ljspeech'
            voc += '_ljspeech'
        # 中文语音合成
        elif lang == 'zh':
            assert voc in ['wavernn', 'pwgan', 'hifigan', 'style_melgan', 'mb_melgan'], "voc must be 'wavernn' or 'pwgan' or 'hifigan' or 'style_melgan' or 'mb_melgan'"
            am += '_csmsc'
            voc += '_csmsc'
        elif lang == 'canton':
            am = 'fastspeech2_canton'
            voc = 'pwgan_aishell3'
            spk_id = 10
        print("am:", am, "voc:", voc, "lang:", lang, "male:", male, "spk_id:", spk_id)
        try:
            cmd = f'paddlespeech tts --am {am} --voc {voc} --input "{text}" --output {save_path} --lang {lang} --spk_id {spk_id} --use_onnx {use_onnx}'
            os.system(cmd)
            wav_file = save_path
        except:
            # 语音合成
            wav_file = self.tts(
                text = text,
                output = save_path,
                am = am,
                voc = voc,
                lang = lang,
                spk_id = spk_id,
                use_onnx=use_onnx
                )
        return wav_file 
```