94insane's picture
230817,V0.1
320e69e
|
raw
history blame
6.5 kB

Korean FastSpeech 2 - Pytorch Implementation

Introduction

Fastspeech2๋Š” ๊ธฐ์กด์˜ ์ž๊ธฐํšŒ๊ท€(Autoregressive) ๊ธฐ๋ฐ˜์˜ ๋Š๋ฆฐ ํ•™์Šต ๋ฐ ํ•ฉ์„ฑ ์†๋„๋ฅผ ๊ฐœ์„ ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋น„์ž๊ธฐํšŒ๊ท€(Non Autoregressive) ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋กœ, Variance Adaptor์—์„œ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ๋“ค์„ ํ†ตํ•ด, speech ์˜ˆ์ธก์˜ ์ •ํ™•๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰ ๊ธฐ์กด์˜ audio-text๋งŒ์œผ๋กœ ์˜ˆ์ธก์„ ํ•˜๋Š” ๋ชจ๋ธ์—์„œ, pitch,energy,duration์„ ์ถ”๊ฐ€ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. Fastspeech2์—์„œ duration์€ MFA(Montreal Forced Aligner)๋ฅผ ํ†ตํ•ด ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ถ”์ถœํ•œ duration์„ ๋ฐ”ํƒ•์œผ๋กœ phoneme(์Œ์†Œ)์™€ ์Œ์„ฑ๊ฐ„์˜ alignment๊ฐ€ ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.

Install Dependencies

python=3.9, pytorch=1.13, ffmpeg g2pk

sudo apt update
sudo apt install ffmpeg
pip install g2pk
pip install -r requirements.txt

Preprocessing

Step 1

MFA(Montreal Forced Aligner)๋Š” Fastspeech2 ํ•™์Šต์— ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•œ, Duration์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. MFA๋Š” ๋ฐœํ™”(์Œ์„ฑ ํŒŒ์ผ)์™€ Phoneme sequence๊ฐ„์˜ alignment๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์ด๋ฅผ TextGrid๋ผ๋Š” ํŒŒ์ผ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

  1. wav-lab pair ์ƒ์„ฑ

wavํŒŒ์ผ๊ณผ ๊ทธ wavํŒŒ์ผ์˜ ๋ฐœํ™”๋ฅผ transcriptํ•œ labํŒŒ์ผ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ด๋‹น ํ•จ์ˆ˜๋Š” metadata๋กœ ๋ถ€ํ„ฐ wavํŒŒ์ผ๊ณผ text๋ฅผ ์ธ์‹ํ•˜์—ฌ, wavํŒŒ์ผ๊ณผ ํ™•์žฅ์ž๋งŒ ๋‹ค๋ฅธ transcriptํŒŒ์ผ(.lab) ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์บก์ฒ˜1

์ž‘์—…์ด ๋๋‚˜๋ฉด ์œ„์˜ ํ˜•ํƒœ์™€ ๊ฐ™์ด wav-lab pair๊ฐ€ ๋งŒ๋“ค์–ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  1. lexicon ํŒŒ์ผ ์ƒ์„ฑ

๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹ ๋‚ด์˜ ๋ชจ๋“  ๋ฐœํ™”์— ๋Œ€ํ•œ, phoneme์„ ๊ธฐ๋กํ•œ lexicon ํŒŒ์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

processing_utils.ipynb ๋…ธํŠธ๋ถ ๋‚ด์˜ make_p_dict ์™€ make_lexicon ํ•จ์ˆ˜๋ฅผ ์ฐจ๋ก€๋Œ€๋กœ ์‹คํ–‰ํ•ด์ฃผ์„ธ์š”.

1

์ž‘์—…์ด ๋๋‚˜๋ฉด ์œ„์™€ ๊ฐ™์€ ํ˜•ํƒœ๋ฅผ ๋„๋Š” p_lexicon.txt ํŒŒ์ผ์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.

  1. MFA ์„ค์น˜
  1. MFA ์‹คํ–‰

MFA์˜ ๊ฒฝ์šฐ pre-trained๋œ ํ•œ๊ตญ์–ด acoustic model๊ณผ g2p ๋ชจ๋ธ์„ ์ œ๊ณตํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ํ•ด๋‹น ๋ชจ๋ธ์€ english phoneme์„ ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•œ๊ตญ์–ด phoneme์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ง์ ‘ train์„ ์‹œ์ผœ์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

MFA ์„ค์น˜๊ฐ€ ์™„๋ฃŒ๋˜์—ˆ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ปค๋ฉ˜๋“œ๋ฅผ ์‹คํ–‰ํ•ด์ฃผ์„ธ์š”.

mfa train <๋ฐ์ดํ„ฐ์…‹ ์œ„์น˜> <p_lexicon์˜ ์œ„์น˜> <out directory>

MFA๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ์‹คํ–‰๋˜์—ˆ์„ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ์˜ TextGrid ํŒŒ์ผ์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค. ์บก์ฒ˜

(3) ๋ฐ์ดํ„ฐ์ „์ฒ˜๋ฆฌ

1.hparms.py

  • dataset : ๋ฐ์ดํ„ฐ์…‹ ํด๋”๋ช…
  • data_path : dataset์˜ ์ƒ์œ„ ํด๋”
  • meta_name : metadata์˜ ํŒŒ์ผ๋ช… ex)transcript.v.1.4.txt
  • textgrid_path : textgrid ์••์ถ• ํŒŒ์ผ์˜ ์œ„์น˜ (textgrid ํŒŒ์ผ๋“ค์„ ๋ฏธ๋ฆฌ ์••์ถ•ํ•ด์ฃผ์„ธ์š”)
  • tetxgrid_name : textgird ์••ํ‘น ํŒŒ์ผ์˜ ํŒŒ์ผ๋ช…
  1. preprocess.py

์บก์ฒ˜

ํ•ด๋‹น ๋ถ€๋ถ„์„ ๋ณธ์ธ์˜ ๋ฐ์ดํ„ฐ์…‹ ์ด๋ฆ„์— ๋งž๊ฒŒ ๋ณ€๊ฒฝํ•ด์ฃผ์„ธ์š”

  1. data/kss.py
  • line 19 : basename,text = parts[?],parts[?] #๊ฐ๊ฐ ํ…์ŠคํŠธ์˜ ์œ„์น˜ ("|")๋กœ splitํ–ˆ์„๋•Œ, metadata์— ๊ธฐ๋ก๋œ wav์™€ text์˜ ์œ„์น˜
  • line 37 : basename,text = parts[?],parts[?]

์œ„์˜ ๋ณ€๊ฒฝ ์ž‘์—…์ด ๋ชจ๋‘ ์™„๋ฃŒ๋˜๋ฉด ์•„๋ž˜์˜ ์ปค๋ฉ˜๋“œ๋ฅผ ์‹คํ–‰ํ•ด์ฃผ์„ธ์š”.

python preprocess.py

Train

๋ชจ๋ธ ํ•™์Šต ์ „์—, kss dataset์— ๋Œ€ํ•ด ์‚ฌ์ „ํ•™์Šต๋œ VocGAN(neural vocoder)์„ ๋‹ค์šด๋กœ๋“œ ํ•˜์—ฌ vocoder/pretrained_models/ ๊ฒฝ๋กœ์— ์œ„์น˜์‹œํ‚ต๋‹ˆ๋‹ค.

๋‹ค์Œ์œผ๋กœ, ์•„๋ž˜์˜ ์ปค๋งจ๋“œ๋ฅผ ์ž…๋ ฅํ•˜์—ฌ ๋ชจ๋ธ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

python train.py

ํ•™์Šต๋œ ๋ชจ๋ธ์€ ckpt/์— ์ €์žฅ๋˜๊ณ  tensorboard log๋Š” log/์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค. ํ•™์Šต์‹œ evaluate ๊ณผ์ •์—์„œ ์ƒ์„ฑ๋œ ์Œ์„ฑ์€ eval/ ํด๋”์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

Synthesis

ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜๋Š” ๋ช…๋ น์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

python synthesis.py --step 500000

ํ•ฉ์„ฑ๋œ ์Œ์„ฑ์€ results/ directory์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Pretrained model

pretrained model(checkpoint)์„ ๋‹ค์šด๋กœ๋“œํ•ด ์ฃผ์„ธ์š”. ๊ทธ ํ›„, hparams.py์— ์žˆ๋Š” checkpoint_path ๋ณ€์ˆ˜์— ๊ธฐ๋ก๋œ ๊ฒฝ๋กœ์— ์œ„์น˜์‹œ์ผœ์ฃผ์‹œ๋ฉด ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Fine-Tuning

Pretrained model์„ ํ™œ์šฉํ•˜์—ฌ Fine-tuning์„ ํ•  ๊ฒฝ์šฐ, ์ตœ์†Œ 30๋ถ„ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค. 10๋ถ„ ์ •๋„ ๋ถ„๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋กœ ์‹คํ—˜์‹œ ๋ชฉ์†Œ๋ฆฌ์™€ ๋ฐœ์Œ์€ ๋Œ€์ฒด์ ์œผ๋กœ ๋น„์Šทํ•˜๊ฒŒ ๋”ฐ๋ผํ•˜๋‚˜ ๋…ธ์ด์ฆˆ๊ฐ€ ์‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค.

Fine-tuning ์‹œ, Learning Rate์˜ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Learning Rate๋Š” ์ ๋‹นํžˆ ๋‚ฎ์€ ๊ฐ’์ด ํ•„์š”ํ•˜๋ฉฐ, ์ด๋Š” ๊ฒฝํ—˜์ ์œผ๋กœ ์•Œ์•„๋‚ด์…”์•ผ ํ•ฉ๋‹ˆ๋‹ค. (์ €๋Š” ์ตœ์ข… step์—์„œ์˜ Learning Rate๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.)

python train.py --restore_step 350000 

Tensorboard

tensorboard --logdir log/hp.dataset/

tensorboard log๋“ค์€ log/hp.dataset/ directory์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ์œ„์˜ ์ปค๋ฉ˜๋“œ๋ฅผ ์ด์šฉํ•˜์—ฌ tensorboard๋ฅผ ์‹คํ–‰ํ•ด ํ•™์Šต ์ƒํ™ฉ์„ ๋ชจ๋‹ˆํ„ฐ๋ง ํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

References