Spaces:

JacobLinCool
/

ZeroRVC

Paused

App Files Files Community

github-actions[bot] commited on Jan 24

Commit

2d9b22b

0 Parent(s):

Sync from https://github.com/JacobLinCool/zero-rvc

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +36 -0
.github/workflows/sync.yml +26 -0
.gitignore +6 -0
LICENSE +19 -0
README.md +57 -0
app.py +49 -0
app/__init__.py +0 -0
app/constants.py +13 -0
app/dataset.py +225 -0
app/dataset_maker.py +225 -0
app/infer.py +164 -0
app/model.py +17 -0
app/settings.py +26 -0
app/train.py +169 -0
app/tutorial.py +30 -0
app/zero.py +24 -0
example-dataset.py +9 -0
example-infer.py +15 -0
example-train.py +38 -0
headers.yaml +8 -0
my-voices/.gitignore +1 -0
pyproject.toml +37 -0
requirements.txt +7 -0
zerorvc/__init__.py +8 -0
zerorvc/assets/mute/mute48k.wav +3 -0
zerorvc/auto_loader.py +1 -0
zerorvc/constants.py +7 -0
zerorvc/dataset.py +253 -0
zerorvc/f0/__init__.py +3 -0
zerorvc/f0/extractor.py +65 -0
zerorvc/f0/load.py +27 -0
zerorvc/f0/rmvpe/__init__.py +6 -0
zerorvc/f0/rmvpe/constants.py +8 -0
zerorvc/f0/rmvpe/deepunet.py +227 -0
zerorvc/f0/rmvpe/mel.py +68 -0
zerorvc/f0/rmvpe/model.py +118 -0
zerorvc/f0/rmvpe/seq.py +18 -0
zerorvc/f0/rmvpe/stft.py +119 -0
zerorvc/hubert/__init__.py +2 -0
zerorvc/hubert/extractor.py +40 -0
zerorvc/hubert/load.py +28 -0
zerorvc/preprocess/__init__.py +2 -0
zerorvc/preprocess/crop.py +16 -0
zerorvc/preprocess/preprocess.py +54 -0
zerorvc/preprocess/slicer2.py +147 -0
zerorvc/pretrained.py +14 -0
zerorvc/rvc.py +366 -0
zerorvc/synthesizer/__init__.py +1 -0
zerorvc/synthesizer/attentions.py +493 -0
zerorvc/synthesizer/commons.py +172 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+*.wav filter=lfs diff=lfs merge=lfs -text

.github/workflows/sync.yml ADDED Viewed

	@@ -0,0 +1,26 @@

+name: Sync to Hugging Face Spaces
+on:
+    push:
+        branches:
+            - main
+jobs:
+    sync:
+        name: Sync
+        runs-on: ubuntu-latest
+        steps:
+            - name: Checkout Repository
+              uses: actions/checkout@v4
+              with:
+                  lfs: true
+            - name: Sync to Hugging Face Spaces
+              uses: JacobLinCool/huggingface-sync@v1
+              with:
+                  github: ${{ secrets.GITHUB_TOKEN }}
+                  user: jacoblincool # Hugging Face username or organization name
+                  space: ZeroRVC # Hugging Face space name
+                  token: ${{ secrets.HF_TOKEN }} # Hugging Face token
+                  configuration: headers.yaml

.gitignore ADDED Viewed

	@@ -0,0 +1,6 @@

+.DS_Store
+*.pyc
+__pycache__
+dist/
+logs/
+separated/

LICENSE ADDED Viewed

	@@ -0,0 +1,19 @@

+Copyright (c) 2024 Jacob Lin <jacob@csie.cool>
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,57 @@

+---
+title: ZeroRVC
+emoji: 🎙️
+colorFrom: gray
+colorTo: gray
+sdk: gradio
+sdk_version: 4.37.2
+app_file: app.py
+pinned: false
+---
+# ZeroRVC
+Run Retrieval-based Voice Conversion training and inference with ease.
+## Features
+- [x] Dataset Preparation
+- [x] Hugging Face Datasets Integration
+- [x] Hugging Face Accelerate Integration
+- [x] Trainer API
+- [x] Inference API
+  - [ ] Index Support
+- [x] Tensorboard Support
+- [ ] FP16 Support
+## Dataset Preparation
+ZeroRVC provides a simple API to prepare your dataset for training. You only need to provide the path to your audio files. The feature extraction models will be downloaded automatically, or you can provide your own with the `hubert` and `rmvpe` arguments.
+```py
+from datasets import load_dataset
+from zerorvc import prepare, RVCTrainer
+dataset = load_dataset("my-audio-dataset")
+dataset = prepare(dataset)
+trainer = RVCTrainer(
+    "my-rvc-model",
+    dataset_train=dataset["train"],
+    dataset_test=dataset["test"],
+)
+trainer.train(epochs=100, batch_size=8, upload="someone/rvc-test-1")
+```
+## Inference
+ZeroRVC provides an easy API to convert your voice with the trained model.
+```py
+from zerorvc import RVC
+import soundfile as sf
+rvc = RVC.from_pretrained("someone/rvc-test-1")
+samples = rvc.convert("test.mp3")
+sf.write("output.wav", samples, rvc.sr)
+```

app.py ADDED Viewed

	@@ -0,0 +1,49 @@

+import gradio as gr
+from app.settings import SettingsTab
+from app.tutorial import TutotialTab
+from app.dataset import DatasetTab
+from app.train import TrainTab
+from app.infer import InferenceTab
+from app.zero import zero_is_available
+if zero_is_available:
+    import torch
+    torch.backends.cuda.matmul.allow_tf32 = True
+with gr.Blocks() as app:
+    gr.Markdown("# ZeroRVC")
+    gr.Markdown(
+        "Run Retrieval-based Voice Conversion training and inference on Hugging Face ZeroGPU or locally."
+    )
+    settings = SettingsTab()
+    tutorial = TutotialTab()
+    dataset = DatasetTab()
+    training = TrainTab()
+    inference = InferenceTab()
+    with gr.Accordion(label="Environment Settings"):
+        settings.ui()
+    with gr.Tabs():
+        with gr.Tab(label="Tutorial", id=0):
+            tutorial.ui()
+        with gr.Tab(label="Dataset", id=1):
+            dataset.ui()
+        with gr.Tab(label="Training", id=2):
+            training.ui()
+        with gr.Tab(label="Inference", id=3):
+            inference.ui()
+    settings.build()
+    tutorial.build()
+    dataset.build(settings.exp_dir, settings.hf_token)
+    training.build(settings.exp_dir, settings.hf_token)
+    inference.build(settings.exp_dir)
+    app.launch()

app/__init__.py ADDED Viewed

File without changes

app/constants.py ADDED Viewed

	@@ -0,0 +1,13 @@

+import os
+from pathlib import Path
+HF_TOKEN = os.environ.get("HF_TOKEN")
+ROOT_EXP_DIR = Path(
+    os.environ.get("ROOT_EXP_DIR")
+    or os.path.join(os.path.dirname(os.path.abspath(__file__)), "../logs")
+).resolve()
+ROOT_EXP_DIR.mkdir(exist_ok=True, parents=True)
+BATCH_SIZE = int(os.environ.get("BATCH_SIZE") or 8)
+TRAINING_EPOCHS = int(os.environ.get("TRAINING_EPOCHS") or 10)

app/dataset.py ADDED Viewed

	@@ -0,0 +1,225 @@

+import os
+import gradio as gr
+import zipfile
+import tempfile
+from zerorvc import prepare
+from datasets import load_dataset, load_from_disk
+from .constants import ROOT_EXP_DIR, BATCH_SIZE
+from .zero import zero
+from .model import accelerator
+def extract_audio_files(zip_file: str, target_dir: str) -> list[str]:
+    with zipfile.ZipFile(zip_file, "r") as zip_ref:
+        zip_ref.extractall(target_dir)
+    audio_files = [
+        os.path.join(target_dir, f)
+        for f in os.listdir(target_dir)
+        if f.endswith((".wav", ".mp3", ".ogg"))
+    ]
+    if not audio_files:
+        raise gr.Error("No audio files found at the top level of the zip file")
+    return audio_files
+def make_dataset_from_zip(exp_dir: str, zip_file: str):
+    if not exp_dir:
+        exp_dir = tempfile.mkdtemp(dir=ROOT_EXP_DIR)
+        print(f"Using exp dir: {exp_dir}")
+    data_dir = os.path.join(exp_dir, "raw_data")
+    if not os.path.exists(data_dir):
+        os.makedirs(data_dir)
+    extract_audio_files(zip_file, data_dir)
+    ds = prepare(
+        data_dir,
+        accelerator=accelerator,
+        batch_size=BATCH_SIZE,
+        stage=1,
+    )
+    return exp_dir, str(ds)
+@zero(duration=120)
+def make_dataset_from_zip_stage_2(exp_dir: str):
+    data_dir = os.path.join(exp_dir, "raw_data")
+    ds = prepare(
+        data_dir,
+        accelerator=accelerator,
+        batch_size=BATCH_SIZE,
+        stage=2,
+    )
+    return exp_dir, str(ds)
+def make_dataset_from_zip_stage_3(exp_dir: str):
+    data_dir = os.path.join(exp_dir, "raw_data")
+    ds = prepare(
+        data_dir,
+        accelerator=accelerator,
+        batch_size=BATCH_SIZE,
+        stage=3,
+    )
+    dataset = os.path.join(exp_dir, "dataset")
+    ds.save_to_disk(dataset)
+    return exp_dir, str(ds)
+def make_dataset_from_repo(repo: str, hf_token: str):
+    ds = load_dataset(repo, token=hf_token)
+    ds = prepare(
+        ds,
+        accelerator=accelerator,
+        batch_size=BATCH_SIZE,
+        stage=1,
+    )
+    return str(ds)
+@zero(duration=120)
+def make_dataset_from_repo_stage_2(repo: str, hf_token: str):
+    ds = load_dataset(repo, token=hf_token)
+    ds = prepare(
+        ds,
+        accelerator=accelerator,
+        batch_size=BATCH_SIZE,
+        stage=2,
+    )
+    return str(ds)
+def make_dataset_from_repo_stage_3(exp_dir: str, repo: str, hf_token: str):
+    ds = load_dataset(repo, token=hf_token)
+    ds = prepare(
+        ds,
+        accelerator=accelerator,
+        batch_size=BATCH_SIZE,
+        stage=3,
+    )
+    if not exp_dir:
+        exp_dir = tempfile.mkdtemp(dir=ROOT_EXP_DIR)
+        print(f"Using exp dir: {exp_dir}")
+    dataset = os.path.join(exp_dir, "dataset")
+    ds.save_to_disk(dataset)
+    return exp_dir, str(ds)
+def use_dataset(exp_dir: str, repo: str, hf_token: str):
+    gr.Info("Fetching dataset")
+    ds = load_dataset(repo, token=hf_token)
+    if not exp_dir:
+        exp_dir = tempfile.mkdtemp(dir=ROOT_EXP_DIR)
+        print(f"Using exp dir: {exp_dir}")
+    dataset = os.path.join(exp_dir, "dataset")
+    ds.save_to_disk(dataset)
+    return exp_dir, str(ds)
+def upload_dataset(exp_dir: str, repo: str, hf_token: str):
+    dataset = os.path.join(exp_dir, "dataset")
+    if not os.path.exists(dataset):
+        raise gr.Error("Dataset not found")
+    gr.Info("Uploading dataset")
+    ds = load_from_disk(dataset)
+    ds.push_to_hub(repo, token=hf_token, private=True)
+    gr.Info("Dataset uploaded successfully")
+class DatasetTab:
+    def __init__(self):
+        pass
+    def ui(self):
+        gr.Markdown("# Dataset")
+        gr.Markdown("The suggested dataset size is > 5 minutes of audio.")
+        gr.Markdown("## Create Dataset from ZIP")
+        gr.Markdown(
+            "Create a dataset by simply upload a zip file containing audio files. The audio files should be at the top level of the zip file."
+        )
+        with gr.Row():
+            self.zip_file = gr.File(
+                label="Upload a zip file containing audio files",
+                file_types=["zip"],
+            )
+            self.make_ds_from_dir = gr.Button(
+                value="Create Dataset from ZIP", variant="primary"
+            )
+        gr.Markdown("## Create Dataset from Dataset Repository")
+        gr.Markdown(
+            "You can also create a dataset from any Hugging Face dataset repository that has 'audio' column."
+        )
+        with gr.Row():
+            self.repo = gr.Textbox(
+                label="Hugging Face Dataset Repository",
+                placeholder="username/dataset-name",
+            )
+            self.make_ds_from_repo = gr.Button(
+                value="Create Dataset from Repo", variant="primary"
+            )
+        gr.Markdown("## Sync Preprocessed Dataset")
+        gr.Markdown(
+            "After you have preprocessed the dataset, you can upload the dataset to Hugging Face. And fetch it back later directly."
+        )
+        with gr.Row():
+            self.preprocessed_repo = gr.Textbox(
+                label="Hugging Face Dataset Repository",
+                placeholder="username/dataset-name",
+            )
+            self.fetch_ds = gr.Button(value="Fetch Dataset", variant="primary")
+            self.upload_ds = gr.Button(value="Upload Dataset", variant="primary")
+        self.ds_state = gr.Textbox(label="Dataset Info", lines=5)
+    def build(self, exp_dir: gr.Textbox, hf_token: gr.Textbox):
+        self.make_ds_from_dir.click(
+            fn=make_dataset_from_zip,
+            inputs=[exp_dir, self.zip_file],
+            outputs=[exp_dir, self.ds_state],
+        ).success(
+            fn=make_dataset_from_zip_stage_2,
+            inputs=[exp_dir],
+            outputs=[exp_dir, self.ds_state],
+        ).success(
+            fn=make_dataset_from_zip_stage_3,
+            inputs=[exp_dir],
+            outputs=[exp_dir, self.ds_state],
+        )
+        self.make_ds_from_repo.click(
+            fn=make_dataset_from_repo,
+            inputs=[self.repo, hf_token],
+            outputs=[self.ds_state],
+        ).success(
+            fn=make_dataset_from_repo_stage_2,
+            inputs=[self.repo, hf_token],
+            outputs=[self.ds_state],
+        ).success(
+            fn=make_dataset_from_repo_stage_3,
+            inputs=[exp_dir, self.repo, hf_token],
+            outputs=[exp_dir, self.ds_state],
+        )
+        self.fetch_ds.click(
+            fn=use_dataset,
+            inputs=[exp_dir, self.preprocessed_repo, hf_token],
+            outputs=[exp_dir, self.ds_state],
+        )
+        self.upload_ds.click(
+            fn=upload_dataset,
+            inputs=[exp_dir, self.preprocessed_repo, hf_token],
+            outputs=[],
+        )

app/dataset_maker.py ADDED Viewed

	@@ -0,0 +1,225 @@

+import yt_dlp
+import numpy as np
+import librosa
+import soundfile as sf
+import os
+import zipfile
+# Function to download audio from YouTube and save it as a WAV file
+def download_youtube_audio(url, audio_name):
+    ydl_opts = {
+        "format": "bestaudio/best",
+        "postprocessors": [
+            {
+                "key": "FFmpegExtractAudio",
+                "preferredcodec": "wav",
+            }
+        ],
+        "outtmpl": f"youtubeaudio/{audio_name}",  # Output template
+    }
+    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
+        ydl.download([url])
+    return f"youtubeaudio/{audio_name}.wav"
+# Function to calculate RMS
+def get_rms(y, frame_length=2048, hop_length=512, pad_mode="constant"):
+    padding = (int(frame_length // 2), int(frame_length // 2))
+    y = np.pad(y, padding, mode=pad_mode)
+    axis = -1
+    out_strides = y.strides + tuple([y.strides[axis]])
+    x_shape_trimmed = list(y.shape)
+    x_shape_trimmed[axis] -= frame_length - 1
+    out_shape = tuple(x_shape_trimmed) + tuple([frame_length])
+    xw = np.lib.stride_tricks.as_strided(y, shape=out_shape, strides=out_strides)
+    if axis < 0:
+        target_axis = axis - 1
+    else:
+        target_axis = axis + 1
+    xw = np.moveaxis(xw, -1, target_axis)
+    slices = [slice(None)] * xw.ndim
+    slices[axis] = slice(0, None, hop_length)
+    x = xw[tuple(slices)]
+    power = np.mean(np.abs(x) ** 2, axis=-2, keepdims=True)
+    return np.sqrt(power)
+# Slicer class
+class Slicer:
+    def __init__(
+        self,
+        sr,
+        threshold=-40.0,
+        min_length=5000,
+        min_interval=300,
+        hop_size=20,
+        max_sil_kept=5000,
+    ):
+        if not min_length >= min_interval >= hop_size:
+            raise ValueError(
+                "The following condition must be satisfied: min_length >= min_interval >= hop_size"
+            )
+        if not max_sil_kept >= hop_size:
+            raise ValueError(
+                "The following condition must be satisfied: max_sil_kept >= hop_size"
+            )
+        min_interval = sr * min_interval / 1000
+        self.threshold = 10 ** (threshold / 20.0)
+        self.hop_size = round(sr * hop_size / 1000)
+        self.win_size = min(round(min_interval), 4 * self.hop_size)
+        self.min_length = round(sr * min_length / 1000 / self.hop_size)
+        self.min_interval = round(min_interval / self.hop_size)
+        self.max_sil_kept = round(sr * max_sil_kept / 1000 / self.hop_size)
+    def _apply_slice(self, waveform, begin, end):
+        if len(waveform.shape) > 1:
+            return waveform[
+                :, begin * self.hop_size : min(waveform.shape[1], end * self.hop_size)
+            ]
+        else:
+            return waveform[
+                begin * self.hop_size : min(waveform.shape[0], end * self.hop_size)
+            ]
+    def slice(self, waveform):
+        if len(waveform.shape) > 1:
+            samples = waveform.mean(axis=0)
+        else:
+            samples = waveform
+        if samples.shape[0] <= self.min_length:
+            return [waveform]
+        rms_list = get_rms(
+            y=samples, frame_length=self.win_size, hop_length=self.hop_size
+        ).squeeze(0)
+        sil_tags = []
+        silence_start = None
+        clip_start = 0
+        for i, rms in enumerate(rms_list):
+            if rms < self.threshold:
+                if silence_start is None:
+                    silence_start = i
+                continue
+            if silence_start is None:
+                continue
+            is_leading_silence = silence_start == 0 and i > self.max_sil_kept
+            need_slice_middle = (
+                i - silence_start >= self.min_interval
+                and i - clip_start >= self.min_length
+            )
+            if not is_leading_silence and not need_slice_middle:
+                silence_start = None
+                continue
+            if i - silence_start <= self.max_sil_kept:
+                pos = rms_list[silence_start : i + 1].argmin() + silence_start
+                if silence_start == 0:
+                    sil_tags.append((0, pos))
+                else:
+                    sil_tags.append((pos, pos))
+                clip_start = pos
+            elif i - silence_start <= self.max_sil_kept * 2:
+                pos = rms_list[
+                    i - self.max_sil_kept : silence_start + self.max_sil_kept + 1
+                ].argmin()
+                pos += i - self.max_sil_kept
+                pos_l = (
+                    rms_list[
+                        silence_start : silence_start + self.max_sil_kept + 1
+                    ].argmin()
+                    + silence_start
+                )
+                pos_r = (
+                    rms_list[i - self.max_sil_kept : i + 1].argmin()
+                    + i
+                    - self.max_sil_kept
+                )
+                if silence_start == 0:
+                    sil_tags.append((0, pos_r))
+                    clip_start = pos_r
+                else:
+                    sil_tags.append((min(pos_l, pos), max(pos_r, pos)))
+                    clip_start = max(pos_r, pos)
+            else:
+                pos_l = (
+                    rms_list[
+                        silence_start : silence_start + self.max_sil_kept + 1
+                    ].argmin()
+                    + silence_start
+                )
+                pos_r = (
+                    rms_list[i - self.max_sil_kept : i + 1].argmin()
+                    + i
+                    - self.max_sil_kept
+                )
+                if silence_start == 0:
+                    sil_tags.append((0, pos_r))
+                else:
+                    sil_tags.append((pos_l, pos_r))
+                clip_start = pos_r
+            silence_start = None
+        total_frames = rms_list.shape[0]
+        if (
+            silence_start is not None
+            and total_frames - silence_start >= self.min_interval
+        ):
+            silence_end = min(total_frames, silence_start + self.max_sil_kept)
+            pos = rms_list[silence_start : silence_end + 1].argmin() + silence_start
+            sil_tags.append((pos, total_frames + 1))
+        if len(sil_tags) == 0:
+            return [waveform]
+        else:
+            chunks = []
+            if sil_tags[0][0] > 0:
+                chunks.append(self._apply_slice(waveform, 0, sil_tags[0][0]))
+            for i in range(len(sil_tags) - 1):
+                chunks.append(
+                    self._apply_slice(waveform, sil_tags[i][1], sil_tags[i + 1][0])
+                )
+            if sil_tags[-1][1] < total_frames:
+                chunks.append(
+                    self._apply_slice(waveform, sil_tags[-1][1], total_frames)
+                )
+            return chunks
+# Function to slice and save audio chunks
+def slice_audio(file_path, audio_name):
+    audio, sr = librosa.load(file_path, sr=None, mono=False)
+    os.makedirs(f"dataset/{audio_name}", exist_ok=True)
+    slicer = Slicer(
+        sr=sr,
+        threshold=-40,
+        min_length=5000,
+        min_interval=500,
+        hop_size=10,
+        max_sil_kept=500,
+    )
+    chunks = slicer.slice(audio)
+    for i, chunk in enumerate(chunks):
+        if len(chunk.shape) > 1:
+            chunk = chunk.T
+        sf.write(f"dataset/{audio_name}/split_{i}.wav", chunk, sr)
+    return f"dataset/{audio_name}"
+# Function to zip the dataset directory
+def zip_directory(directory_path, audio_name):
+    zip_file = f"dataset/{audio_name}.zip"
+    os.makedirs(os.path.dirname(zip_file), exist_ok=True)  # Ensure the directory exists
+    with zipfile.ZipFile(zip_file, "w", zipfile.ZIP_DEFLATED) as zipf:
+        for root, dirs, files in os.walk(directory_path):
+            for file in files:
+                file_path = os.path.join(root, file)
+                arcname = os.path.relpath(file_path, start=directory_path)
+                zipf.write(file_path, arcname)
+    return zip_file
+# Gradio interface
+def process_audio(url, audio_name):
+    file_path = download_youtube_audio(url, audio_name)
+    dataset_path = slice_audio(file_path, audio_name)
+    zip_file = zip_directory(dataset_path, audio_name)
+    return zip_file, print(f"{zip_file} successfully processed")

app/infer.py ADDED Viewed

	@@ -0,0 +1,164 @@

+import os
+import shutil
+import hashlib
+from pathlib import Path
+from typing import Tuple
+from demucs.separate import main as demucs
+import gradio as gr
+import numpy as np
+import soundfile as sf
+from zerorvc import RVC
+from .zero import zero
+from .model import device
+import yt_dlp
+def download_audio(url):
+    ydl_opts = {
+        "format": "bestaudio/best",
+        "outtmpl": "ytdl/%(title)s.%(ext)s",
+        "postprocessors": [
+            {
+                "key": "FFmpegExtractAudio",
+                "preferredcodec": "wav",
+                "preferredquality": "192",
+            }
+        ],
+    }
+    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
+        info_dict = ydl.extract_info(url, download=True)
+        file_path = ydl.prepare_filename(info_dict).rsplit(".", 1)[0] + ".wav"
+        sample_rate, audio_data = read(file_path)
+        audio_array = np.asarray(audio_data, dtype=np.int16)
+        return sample_rate, audio_array
+@zero(duration=120)
+def infer(
+    exp_dir: str, original_audio: str, pitch_mod: int, protect: float
+) -> Tuple[int, np.ndarray]:
+    checkpoint_dir = os.path.join(exp_dir, "checkpoints")
+    if not os.path.exists(checkpoint_dir):
+        raise gr.Error("Model not found")
+    # rename the original audio to the hash
+    with open(original_audio, "rb") as f:
+        original_audio_hash = hashlib.md5(f.read()).hexdigest()
+    ext = Path(original_audio).suffix
+    original_audio_hashed = os.path.join(exp_dir, f"{original_audio_hash}{ext}")
+    shutil.copy(original_audio, original_audio_hashed)
+    out = os.path.join("separated", "htdemucs", original_audio_hash, "vocals.wav")
+    if not os.path.exists(out):
+        demucs(
+            [
+                "--two-stems",
+                "vocals",
+                "-d",
+                str(device),
+                "-n",
+                "htdemucs",
+                original_audio_hashed,
+            ]
+        )
+    rvc = RVC.from_pretrained(checkpoint_dir)
+    samples = rvc.convert(out, pitch_modification=pitch_mod, protect=protect)
+    file = os.path.join(exp_dir, "infer.wav")
+    sf.write(file, samples, rvc.sr)
+    return file
+def merge(exp_dir: str, original_audio: str, vocal: Tuple[int, np.ndarray]) -> str:
+    with open(original_audio, "rb") as f:
+        original_audio_hash = hashlib.md5(f.read()).hexdigest()
+    music = os.path.join("separated", "htdemucs", original_audio_hash, "no_vocals.wav")
+    tmp = os.path.join(exp_dir, "tmp.wav")
+    sf.write(tmp, vocal[1], vocal[0])
+    os.system(
+        f"ffmpeg -i {music} -i {tmp} -filter_complex '[1]volume=2[a];[0][a]amix=inputs=2:duration=first:dropout_transition=2' -ac 2 -y {tmp}.merged.mp3"
+    )
+    return f"{tmp}.merged.mp3"
+class InferenceTab:
+    def __init__(self):
+        pass
+    def ui(self):
+        gr.Markdown("# Inference")
+        gr.Markdown(
+            "After trained model is pruned, you can use it to infer on new music. \n"
+            "Upload the original audio and adjust the F0 add value to generate the inferred audio."
+        )
+        with gr.Row():
+            self.original_audio = gr.Audio(
+                label="Upload original audio",
+                type="filepath",
+                show_download_button=True,
+            )
+            with gr.Accordion("inference by Link", open=False):
+                with gr.Row():
+                    youtube_link = gr.Textbox(
+                        label="Link",
+                        placeholder="Paste the link here",
+                        interactive=True,
+                    )
+                with gr.Row():
+                    gr.Markdown(
+                        "You can paste the link to the video/audio from many sites, check the complete list [here](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md)"
+                    )
+                with gr.Row():
+                    download_button = gr.Button("Download!", variant="primary")
+                    download_button.click(
+                        download_audio, [youtube_link], [self.original_audio]
+                    )
+            with gr.Column():
+                self.pitch_mod = gr.Slider(
+                    label="Pitch Modification +/-",
+                    minimum=-16,
+                    maximum=16,
+                    step=1,
+                    value=0,
+                )
+                self.protect = gr.Slider(
+                    label="Protect",
+                    minimum=0,
+                    maximum=0.5,
+                    step=0.01,
+                    value=0.33,
+                )
+            self.infer_btn = gr.Button(value="Infer", variant="primary")
+        with gr.Row():
+            self.infer_output = gr.Audio(
+                label="Inferred audio", show_download_button=True, format="mp3"
+            )
+        with gr.Row():
+            self.merge_output = gr.Audio(
+                label="Merged audio", show_download_button=True, format="mp3"
+            )
+    def build(self, exp_dir: gr.Textbox):
+        self.infer_btn.click(
+            fn=infer,
+            inputs=[
+                exp_dir,
+                self.original_audio,
+                self.pitch_mod,
+                self.protect,
+            ],
+            outputs=[self.infer_output],
+        ).success(
+            fn=merge,
+            inputs=[exp_dir, self.original_audio, self.infer_output],
+            outputs=[self.merge_output],
+        )

app/model.py ADDED Viewed

	@@ -0,0 +1,17 @@

+import logging
+from accelerate import Accelerator
+from zerorvc import load_hubert, load_rmvpe
+logger = logging.getLogger(__name__)
+accelerator = Accelerator()
+device = accelerator.device
+logger.info(f"device: {device}")
+logger.info(f"mixed_precision: {accelerator.mixed_precision}")
+rmvpe = load_rmvpe(device=device)
+logger.info("RMVPE model loaded.")
+hubert = load_hubert(device=device)
+logger.info("HuBERT model loaded.")

app/settings.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import gradio as gr
+from .constants import HF_TOKEN
+class SettingsTab:
+    def __init__(self):
+        pass
+    def ui(self):
+        self.exp_dir = gr.Textbox(
+            label="Temporary Experiment Directory (auto-managed)",
+            placeholder="It will be auto-generated after setup",
+            interactive=True,
+        )
+        gr.Markdown(
+            "### Sync with Hugging Face 🤗\n\nThe access token will be use to upload/download the dataset and model."
+        )
+        self.hf_token = gr.Textbox(
+            label="Hugging Face Access Token",
+            placeholder="Paste your Hugging Face access token here (hf_...)",
+            value=HF_TOKEN,
+            interactive=True,
+        )
+    def build(self):
+        pass

app/train.py ADDED Viewed

	@@ -0,0 +1,169 @@

+import os
+import tempfile
+import gradio as gr
+import torch
+from zerorvc import RVCTrainer, pretrained_checkpoints, SynthesizerTrnMs768NSFsid
+from zerorvc.trainer import TrainingCheckpoint
+from datasets import load_from_disk
+from huggingface_hub import snapshot_download
+from .zero import zero
+from .model import accelerator, device
+from .constants import BATCH_SIZE, ROOT_EXP_DIR, TRAINING_EPOCHS
+@zero(duration=240)
+def train_model(exp_dir: str, progress=gr.Progress()):
+    dataset = os.path.join(exp_dir, "dataset")
+    if not os.path.exists(dataset):
+        raise gr.Error("Dataset not found. Please prepare the dataset first.")
+    ds = load_from_disk(dataset)
+    checkpoint_dir = os.path.join(exp_dir, "checkpoints")
+    trainer = RVCTrainer(checkpoint_dir)
+    resume_from = trainer.latest_checkpoint()
+    if resume_from is None:
+        resume_from = pretrained_checkpoints()
+        gr.Info(f"Starting training from pretrained checkpoints.")
+    else:
+        gr.Info(f"Resuming training from {resume_from}")
+    tqdm = progress.tqdm(
+        trainer.train(
+            dataset=ds["train"],
+            resume_from=resume_from,
+            batch_size=BATCH_SIZE,
+            epochs=TRAINING_EPOCHS,
+            accelerator=accelerator,
+        ),
+        total=TRAINING_EPOCHS,
+        unit="epochs",
+        desc="Training",
+    )
+    for ckpt in tqdm:
+        info = f"Epoch: {ckpt.epoch} loss: (gen: {ckpt.loss_gen:.4f}, fm: {ckpt.loss_fm:.4f}, mel: {ckpt.loss_mel:.4f}, kl: {ckpt.loss_kl:.4f}, disc: {ckpt.loss_disc:.4f})"
+        print(info)
+        latest: TrainingCheckpoint = ckpt
+    latest.save(trainer.checkpoint_dir)
+    latest.G.save_pretrained(trainer.checkpoint_dir)
+    result = f"{TRAINING_EPOCHS} epochs trained. Latest loss: (gen: {latest.loss_gen:.4f}, fm: {latest.loss_fm:.4f}, mel: {latest.loss_mel:.4f}, kl: {latest.loss_kl:.4f}, disc: {latest.loss_disc:.4f})"
+    del trainer
+    if device.type == "cuda":
+        torch.cuda.empty_cache()
+    return result
+def upload_model(exp_dir: str, repo: str, hf_token: str):
+    checkpoint_dir = os.path.join(exp_dir, "checkpoints")
+    if not os.path.exists(checkpoint_dir):
+        raise gr.Error("Model not found")
+    gr.Info("Uploading model")
+    model = SynthesizerTrnMs768NSFsid.from_pretrained(checkpoint_dir)
+    model.push_to_hub(repo, token=hf_token, private=True)
+    gr.Info("Model uploaded successfully")
+def upload_checkpoints(exp_dir: str, repo: str, hf_token: str):
+    checkpoint_dir = os.path.join(exp_dir, "checkpoints")
+    if not os.path.exists(checkpoint_dir):
+        raise gr.Error("Checkpoints not found")
+    gr.Info("Uploading checkpoints")
+    trainer = RVCTrainer(checkpoint_dir)
+    trainer.push_to_hub(repo, token=hf_token, private=True)
+    gr.Info("Checkpoints uploaded successfully")
+def fetch_model(exp_dir: str, repo: str, hf_token: str):
+    if not exp_dir:
+        exp_dir = tempfile.mkdtemp(dir=ROOT_EXP_DIR)
+    checkpoint_dir = os.path.join(exp_dir, "checkpoints")
+    gr.Info("Fetching model")
+    files = ["README.md", "config.json", "model.safetensors"]
+    snapshot_download(
+        repo, token=hf_token, local_dir=checkpoint_dir, allow_patterns=files
+    )
+    gr.Info("Model fetched successfully")
+    return exp_dir
+def fetch_checkpoints(exp_dir: str, repo: str, hf_token: str):
+    if not exp_dir:
+        exp_dir = tempfile.mkdtemp(dir=ROOT_EXP_DIR)
+    checkpoint_dir = os.path.join(exp_dir, "checkpoints")
+    gr.Info("Fetching checkpoints")
+    snapshot_download(repo, token=hf_token, local_dir=checkpoint_dir)
+    gr.Info("Checkpoints fetched successfully")
+    return exp_dir
+class TrainTab:
+    def __init__(self):
+        pass
+    def ui(self):
+        gr.Markdown("# Training")
+        gr.Markdown(
+            "You can start training the model by clicking the button below. "
+            f"Each time you click the button, the model will train for {TRAINING_EPOCHS} epochs, which takes about 3 minutes on ZeroGPU (A100). "
+        )
+        with gr.Row():
+            self.train_btn = gr.Button(value="Train", variant="primary")
+            self.result = gr.Textbox(label="Training Result", lines=3)
+        gr.Markdown("## Sync Model and Checkpoints with Hugging Face")
+        gr.Markdown(
+            "You can upload the trained model and checkpoints to Hugging Face for sharing or further training."
+        )
+        self.repo = gr.Textbox(label="Repository ID", placeholder="username/repo")
+        with gr.Row():
+            self.upload_model_btn = gr.Button(value="Upload Model", variant="primary")
+            self.upload_checkpoints_btn = gr.Button(
+                value="Upload Checkpoints", variant="primary"
+            )
+        with gr.Row():
+            self.fetch_mode_btn = gr.Button(value="Fetch Model", variant="primary")
+            self.fetch_checkpoints_btn = gr.Button(
+                value="Fetch Checkpoints", variant="primary"
+            )
+    def build(self, exp_dir: gr.Textbox, hf_token: gr.Textbox):
+        self.train_btn.click(
+            fn=train_model,
+            inputs=[exp_dir],
+            outputs=[self.result],
+        )
+        self.upload_model_btn.click(
+            fn=upload_model,
+            inputs=[exp_dir, self.repo, hf_token],
+        )
+        self.upload_checkpoints_btn.click(
+            fn=upload_checkpoints,
+            inputs=[exp_dir, self.repo, hf_token],
+        )
+        self.fetch_mode_btn.click(
+            fn=fetch_model,
+            inputs=[exp_dir, self.repo, hf_token],
+            outputs=[exp_dir],
+        )
+        self.fetch_checkpoints_btn.click(
+            fn=fetch_checkpoints,
+            inputs=[exp_dir, self.repo, hf_token],
+            outputs=[exp_dir],
+        )

app/tutorial.py ADDED Viewed

	@@ -0,0 +1,30 @@

+import gradio as gr
+class TutotialTab:
+    def __init__(self):
+        pass
+    def ui(self):
+        gr.Markdown(
+            """
+            # Welcome to ZeroRVC!
+            > If you are more satisfied with Python code, you can also [use the Python API to run ZeroRVC](https://pypi.org/project/zerorvc/).
+            ZeroRVC is a toolkit for training and inference of retrieval-based voice conversion models.
+            By leveraging the power of Hugging Face ZeroGPU, you can train your model in minutes without setting up the environment.
+            ## How to Use
+            There are 3 main steps to use ZeroRVC:
+            - **Make Dataset**: Prepare your dataset for training. You can upload a zip file containing audio files.
+            - **Model Training**: Train your model using the prepared dataset.
+            - **Model Inference**: Try your model.
+            """
+        )
+    def build(self):
+        pass

app/zero.py ADDED Viewed

	@@ -0,0 +1,24 @@

+import os
+import logging
+logger = logging.getLogger(__name__)
+zero_is_available = "SPACES_ZERO_GPU" in os.environ
+if zero_is_available:
+    import spaces  # type: ignore
+    logger.info("ZeroGPU is available")
+else:
+    logger.info("ZeroGPU is not available")
+# a decorator that applies the spaces.GPU decorator if zero is available
+def zero(duration=60):
+    def wrapper(func):
+        if zero_is_available:
+            return spaces.GPU(func, duration=duration)
+        else:
+            return func
+    return wrapper

example-dataset.py ADDED Viewed

	@@ -0,0 +1,9 @@

+import os
+from zerorvc import prepare
+HF_TOKEN = os.environ.get("HF_TOKEN")
+dataset = prepare("./my-voices")
+print(dataset)
+dataset.push_to_hub("my-rvc-dataset", token=HF_TOKEN, private=True)

example-infer.py ADDED Viewed

	@@ -0,0 +1,15 @@

+import os
+from zerorvc import RVC
+import soundfile as sf
+HF_TOKEN = os.environ.get("HF_TOKEN")
+MODEL = "JacobLinCool/my-rvc-model3"
+rvc = RVC.from_pretrained(MODEL, token=HF_TOKEN)
+samples = rvc.convert("test.mp3")
+sf.write("output.wav", samples, rvc.sr)
+pitch_modifications = [-12, -8, -4, 4, 8, 12]
+for pitch_modification in pitch_modifications:
+    samples = rvc.convert("test.mp3", pitch_modification=pitch_modification)
+    sf.write(f"output-{pitch_modification}.wav", samples, rvc.sr)

example-train.py ADDED Viewed

	@@ -0,0 +1,38 @@

+import os
+from datasets import load_dataset
+from tqdm import tqdm
+from zerorvc import RVCTrainer, pretrained_checkpoints
+HF_TOKEN = os.environ.get("HF_TOKEN")
+EPOCHS = 100
+BATCH_SIZE = 8
+DATASET = "JacobLinCool/my-rvc-dataset"
+MODEL = "JacobLinCool/my-rvc-model"
+dataset = load_dataset(DATASET, token=HF_TOKEN)
+print(dataset)
+trainer = RVCTrainer(checkpoint_dir="./checkpoints")
+training = tqdm(
+    trainer.train(
+        dataset=dataset["train"],
+        resume_from=pretrained_checkpoints(),  # resume training from the pretrained VCTK checkpoint
+        epochs=EPOCHS,
+        batch_size=BATCH_SIZE,
+    ),
+    total=EPOCHS,
+)
+# Training loop: iterate over epochs
+for checkpoint in training:
+    training.set_description(
+        f"Epoch {checkpoint.epoch}/{EPOCHS} loss: (gen: {checkpoint.loss_gen:.4f}, fm: {checkpoint.loss_fm:.4f}, mel: {checkpoint.loss_mel:.4f}, kl: {checkpoint.loss_kl:.4f}, disc: {checkpoint.loss_disc:.4f})"
+    )
+    # Save checkpoint every 10 epochs
+    if checkpoint.epoch % 10 == 0:
+        checkpoint.save(checkpoint_dir=trainer.checkpoint_dir)
+        # Directly push the synthesizer to the Hugging Face Hub
+        checkpoint.G.push_to_hub(MODEL, token=HF_TOKEN, private=True)
+print("Training completed.")

headers.yaml ADDED Viewed

	@@ -0,0 +1,8 @@

+title: ZeroRVC
+emoji: 🎙️
+colorFrom: gray
+colorTo: gray
+sdk: gradio
+sdk_version: 4.37.2
+app_file: app.py
+pinned: false

my-voices/.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.wav

pyproject.toml ADDED Viewed

	@@ -0,0 +1,37 @@

+[project]
+name = "zerorvc"
+version = "0.0.19"
+authors = [{ name = "Jacob Lin", email = "jacob@csie.cool" }]
+description = "Run Retrieval-based Voice Conversion training and inference with ease."
+readme = "README.md"
+requires-python = ">=3.8"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: MIT License",
+    "Operating System :: OS Independent",
+]
+dependencies = [
+    "numpy>=1.0.0",
+    "torch>=2.0.0",
+    "datasets",
+    "accelerate",
+    "transformers",
+    "huggingface_hub",
+    "tqdm",
+    "librosa",
+    "scipy",
+    "tensorboard",
+]
+[project.urls]
+Homepage = "https://github.com/jacoblincool/zero-rvc"
+Issues = "https://github.com/jacoblincool/zero-rvc/issues"
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.sdist]
+include = ["zerorvc/**/*", "pyproject.toml", "README.md", "LICENSE"]
+[tool.hatch.build.targets.wheel]
+packages = ["zerorvc"]

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+zerorvc>=0.0.10
+# gradio app deps
+gradio
+demucs==4.0.1
+yt_dlp
+tensorboard

zerorvc/__init__.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from .rvc import RVC
+from .trainer import RVCTrainer
+from .dataset import prepare
+from .synthesizer import SynthesizerTrnMs768NSFsid
+from .pretrained import pretrained_checkpoints
+from .f0 import load_rmvpe, RMVPE, F0Extractor
+from .hubert import load_hubert, HubertModel, HubertFeatureExtractor
+from .auto_loader import auto_loaded_model

zerorvc/assets/mute/mute48k.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2f2bb4daaa106e351aebb001e5a25de985c0b472f22e8d60676bc924a79056ee
+size 288078

zerorvc/auto_loader.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ auto_loaded_model = {}

zerorvc/constants.py ADDED Viewed

	@@ -0,0 +1,7 @@

+SR_16K = 16000
+SR_48K = 48000
+N_FFT = 2048
+HOP_LENGTH = 480
+WIN_LENGTH = 2048
+N_MELS = 128

zerorvc/dataset.py ADDED Viewed

	@@ -0,0 +1,253 @@

+import os
+import numpy as np
+import torch
+import librosa
+import logging
+import shutil
+from pkg_resources import resource_filename
+from accelerate import Accelerator
+from datasets import load_dataset, DatasetDict, Dataset, Audio
+from .preprocess import Preprocessor, crop_feats_length
+from .hubert import HubertFeatureExtractor, HubertModel, load_hubert
+from .f0 import F0Extractor, RMVPE, load_rmvpe
+from .constants import *
+logger = logging.getLogger(__name__)
+def extract_hubert_features(
+    rows,
+    hfe: HubertFeatureExtractor,
+    hubert: str | HubertModel | None,
+    device: torch.device,
+):
+    if not hfe.is_loaded():
+        model = load_hubert(hubert, device)
+        hfe.load(model)
+    feats = []
+    for row in rows["wav_16k"]:
+        feat = hfe.extract_feature_from(row["array"].astype("float32"))
+        feats.append(feat)
+    return {"hubert_feats": feats}
+def extract_f0_features(
+    rows, f0e: F0Extractor, rmvpe: str | RMVPE | None, device: torch.device
+):
+    if not f0e.is_loaded():
+        model = load_rmvpe(rmvpe, device)
+        f0e.load(model)
+    f0s = []
+    f0nsfs = []
+    for row in rows["wav_16k"]:
+        f0nsf, f0 = f0e.extract_f0_from(row["array"].astype("float32"))
+        f0s.append(f0)
+        f0nsfs.append(f0nsf)
+    return {"f0": f0s, "f0nsf": f0nsfs}
+def feature_postprocess(rows):
+    phones = rows["hubert_feats"]
+    for i, phone in enumerate(phones):
+        phone = np.repeat(phone, 2, axis=0)
+        n_num = min(phone.shape[0], 900)
+        phone = phone[:n_num, :]
+        phones[i] = phone
+        if "f0" in rows:
+            pitch = rows["f0"][i]
+            pitch = pitch[:n_num]
+            pitch = np.array(pitch, dtype=np.float32)
+            rows["f0"][i] = pitch
+        if "f0nsf" in rows:
+            pitchf = rows["f0nsf"][i]
+            pitchf = pitchf[:n_num]
+            rows["f0nsf"][i] = pitchf
+    return rows
+def calculate_spectrogram(
+    rows, n_fft=N_FFT, hop_length=HOP_LENGTH, win_length=WIN_LENGTH
+):
+    specs = []
+    hann_window = np.hanning(win_length)
+    pad_amount = int((win_length - hop_length) / 2)
+    for row in rows["wav_gt"]:
+        stft = librosa.stft(
+            np.pad(row["array"], (pad_amount, pad_amount), mode="reflect"),
+            n_fft=n_fft,
+            hop_length=hop_length,
+            win_length=win_length,
+            window=hann_window,
+            center=False,
+        )
+        specs.append(np.abs(stft) + 1e-6)
+    return {"spec": specs}
+def fix_length(rows, hop_length=HOP_LENGTH):
+    for i, row in enumerate(rows["spec"]):
+        spec = np.array(row)
+        phone = np.array(rows["hubert_feats"][i])
+        pitch = np.array(rows["f0"][i])
+        pitchf = np.array(rows["f0nsf"][i])
+        wav_gt = np.array(rows["wav_gt"][i]["array"])
+        spec, phone, pitch, pitchf = crop_feats_length(spec, phone, pitch, pitchf)
+        phone_len = phone.shape[0]
+        wav_gt = wav_gt[: phone_len * hop_length]
+        rows["hubert_feats"][i] = phone
+        rows["f0"][i] = pitch
+        rows["f0nsf"][i] = pitchf
+        rows["spec"][i] = spec
+        rows["wav_gt"][i]["array"] = wav_gt
+    return rows
+def prepare(
+    dir: str | DatasetDict,
+    sr=SR_48K,
+    hubert: str | HubertModel | None = None,
+    rmvpe: str | RMVPE | None = None,
+    batch_size=1,
+    max_slice_length: float | None = 3.0,
+    accelerator: Accelerator = None,
+    include_mute=True,
+    stage=3,
+):
+    """
+    Prepare the dataset for training or evaluation.
+    Args:
+        dir (str | DatasetDict): The directory path or DatasetDict object containing the dataset.
+        sr (int, optional): The target sampling rate. Defaults to SR_48K.
+        hubert (str | HubertModel | None, optional): The Hubert model or its name to use for feature extraction. Defaults to None.
+        rmvpe (str | RMVPE | None, optional): The RMVPE model or its name to use for feature extraction. Defaults to None.
+        batch_size (int, optional): The batch size for processing the dataset. Defaults to 1.
+        accelerator (Accelerator, optional): The accelerator object for distributed training. Defaults to None.
+        include_mute (bool, optional): Whether to include a mute audio file in the directory dataset. Defaults to True.
+        stage (int, optional): The dataset preparation level to perform. Defaults to 3. (Stage 1 and 3 are CPU intensive, Stage 2 is GPU intensive.)
+    Returns:
+        DatasetDict: The prepared dataset.
+    """
+    if accelerator is None:
+        accelerator = Accelerator()
+    if isinstance(dir, (DatasetDict, Dataset)):
+        ds = dir
+    else:
+        mute_source = resource_filename("zerorvc", "assets/mute/mute48k.wav")
+        mute_dest = os.path.join(dir, "mute.wav")
+        if include_mute and not os.path.exists(mute_dest):
+            logger.info(f"Copying {mute_source} to {mute_dest}")
+            shutil.copy(mute_source, mute_dest)
+        ds: DatasetDict | Dataset = load_dataset("audiofolder", data_dir=dir)
+    for key in ds:
+        ds[key] = ds[key].remove_columns(
+            [col for col in ds[key].column_names if col != "audio"]
+        )
+    ds = ds.cast_column("audio", Audio(sampling_rate=sr))
+    if stage <= 0:
+        return ds
+    # Stage 1, CPU intensive
+    pp = Preprocessor(sr, max_slice_length) if max_slice_length is not None else None
+    def preprocess(rows):
+        wav_gt = []
+        wav_16k = []
+        for row in rows["audio"]:
+            if pp is not None:
+                slices = pp.preprocess_audio(row["array"])
+                for slice in slices:
+                    wav_gt.append({"path": "", "array": slice, "sampling_rate": sr})
+                    slice16k = librosa.resample(slice, orig_sr=sr, target_sr=SR_16K)
+                    wav_16k.append(
+                        {"path": "", "array": slice16k, "sampling_rate": SR_16K}
+                    )
+            else:
+                slice = row["array"]
+                wav_gt.append({"path": "", "array": slice, "sampling_rate": sr})
+                slice16k = librosa.resample(slice, orig_sr=sr, target_sr=SR_16K)
+                wav_16k.append({"path": "", "array": slice16k, "sampling_rate": SR_16K})
+        return {"wav_gt": wav_gt, "wav_16k": wav_16k}
+    ds = ds.map(
+        preprocess, batched=True, batch_size=batch_size, remove_columns=["audio"]
+    )
+    ds = ds.cast_column("wav_gt", Audio(sampling_rate=sr))
+    ds = ds.cast_column("wav_16k", Audio(sampling_rate=SR_16K))
+    if stage <= 1:
+        return ds
+    # Stage 2, GPU intensive
+    hfe = HubertFeatureExtractor()
+    ds = ds.map(
+        extract_hubert_features,
+        batched=True,
+        batch_size=batch_size,
+        fn_kwargs={"hfe": hfe, "hubert": hubert, "device": accelerator.device},
+    )
+    f0e = F0Extractor()
+    ds = ds.map(
+        extract_f0_features,
+        batched=True,
+        batch_size=batch_size,
+        fn_kwargs={"f0e": f0e, "rmvpe": rmvpe, "device": accelerator.device},
+    )
+    if stage <= 2:
+        return ds
+    # Stage 3, CPU intensive
+    ds = ds.map(feature_postprocess, batched=True, batch_size=batch_size)
+    ds = ds.map(calculate_spectrogram, batched=True, batch_size=batch_size)
+    ds = ds.map(fix_length, batched=True, batch_size=batch_size)
+    return ds
+def show_dataset_pitch_distribution(dataset):
+    import matplotlib.pyplot as plt
+    import seaborn as sns
+    import numpy as np
+    sns.set_theme()
+    pitches = []
+    for row in dataset["f0"]:
+        pitches.extend([p for p in row if p != 1])
+    pitches = np.array(pitches)
+    stats = {
+        "mean": np.mean(pitches),
+        "std": np.std(pitches),
+        "min": np.min(pitches),
+        "max": np.max(pitches),
+        "median": np.median(pitches),
+        "q1": np.percentile(pitches, 25),
+        "q3": np.percentile(pitches, 75),
+    }
+    plt.figure(figsize=(10, 6))
+    sns.histplot(pitches, bins=100)
+    plt.title(
+        f"Pitch Distribution\nMean: {stats['mean']:.1f} ± {stats['std']:.1f}\n"
+        f"Range: [{stats['min']:.1f}, {stats['max']:.1f}]\n"
+        f"Quartiles: [{stats['q1']:.1f}, {stats['median']:.1f}, {stats['q3']:.1f}]"
+    )
+    plt.xlabel("Frequency (Note)")
+    plt.ylabel("Count")
+    plt.show()

zerorvc/f0/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+from .extractor import F0Extractor
+from .rmvpe import RMVPE
+from .load import load_rmvpe

zerorvc/f0/extractor.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import logging
+import numpy as np
+import librosa
+from .rmvpe import RMVPE
+from ..constants import SR_16K
+logger = logging.getLogger(__name__)
+class F0Extractor:
+    def __init__(
+        self,
+        rmvpe: RMVPE = None,
+        sr=SR_16K,
+        f0_bin=256,
+        f0_max=1100.0,
+        f0_min=50.0,
+    ):
+        self.sr = sr
+        self.f0_bin = f0_bin
+        self.f0_max = f0_max
+        self.f0_min = f0_min
+        self.f0_mel_min = 1127 * np.log(1 + f0_min / 700)
+        self.f0_mel_max = 1127 * np.log(1 + f0_max / 700)
+        if rmvpe is not None:
+            self.load(rmvpe)
+    def load(self, rmvpe: RMVPE):
+        self.rmvpe = rmvpe
+        self.device = next(rmvpe.parameters()).device
+        logger.info(f"RMVPE model is on {self.device}")
+    def is_loaded(self) -> bool:
+        return hasattr(self, "rmvpe")
+    def calculate_f0_from_f0nsf(self, f0nsf: np.ndarray):
+        f0_mel = 1127 * np.log(1 + f0nsf / 700)
+        f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - self.f0_mel_min) * (
+            self.f0_bin - 2
+        ) / (self.f0_mel_max - self.f0_mel_min) + 1
+        # use 0 or 1
+        f0_mel[f0_mel <= 1] = 1
+        f0_mel[f0_mel > self.f0_bin - 1] = self.f0_bin - 1
+        f0 = np.rint(f0_mel).astype(int)
+        assert f0.max() <= 255 and f0.min() >= 1, (
+            f0.max(),
+            f0.min(),
+        )
+        return f0
+    def extract_f0_from(self, y: np.ndarray, modification=0.0):
+        f0nsf = self.rmvpe.infer_from_audio(y, thred=0.03)
+        f0nsf *= pow(2, modification / 12)
+        f0 = self.calculate_f0_from_f0nsf(f0nsf)
+        return f0nsf, f0
+    def extract_f0(self, wav_file: str):
+        y, _ = librosa.load(wav_file, sr=self.sr)
+        return self.extract_f0_from(y)

zerorvc/f0/load.py ADDED Viewed

	@@ -0,0 +1,27 @@

+import torch
+from .rmvpe import RMVPE
+def load_rmvpe(
+    rmvpe: str | RMVPE | None = None, device: torch.device = torch.device("cpu")
+) -> RMVPE:
+    """
+    Load the RMVPE model from a file or download it if necessary.
+    If a loaded model is provided, it will be returned as is.
+    Args:
+        rmvpe (str | RMVPE | None): The path to the RMVPE model file or the pre-loaded RMVPE model. If None, the default model will be downloaded.
+        device (torch.device): The device to load the model on.
+    Returns:
+        RMVPE: The loaded RMVPE model.
+    Raises:
+        If the model file does not exist.
+    """
+    if isinstance(rmvpe, RMVPE):
+        return rmvpe.to(device)
+    if isinstance(rmvpe, str):
+        model = RMVPE.from_pretrained(rmvpe).to(device)
+        return model
+    return RMVPE.from_pretrained("safe-models/RMVPE").to(device)

zerorvc/f0/rmvpe/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# The RMVPE model is from https://github.com/Dream-High/RMVPE
+# Apache License 2.0: https://github.com/Dream-High/RMVPE/blob/main/LICENSE
+# With modifications from https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/infer/lib/rmvpe.py
+# MIT License: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/LICENSE
+from .model import RMVPE

zerorvc/f0/rmvpe/constants.py ADDED Viewed

	@@ -0,0 +1,8 @@

+N_CLASS = 360
+N_MELS = 128
+MAGIC_CONST = 1997.3794084376191
+SAMPLE_RATE = 16000
+WINDOW_LENGTH = 1024
+HOP_LENGTH = 160
+MEL_FMIN = 30
+MEL_FMAX = SAMPLE_RATE // 2

zerorvc/f0/rmvpe/deepunet.py ADDED Viewed

	@@ -0,0 +1,227 @@

+from typing import List, Tuple
+import torch
+from torch import nn
+from .constants import *
+class ConvBlockRes(nn.Module):
+    def __init__(self, in_channels: int, out_channels: int, momentum=0.01):
+        super().__init__()
+        self.conv = nn.Sequential(
+            nn.Conv2d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=(3, 3),
+                stride=(1, 1),
+                padding=(1, 1),
+                bias=False,
+            ),
+            nn.BatchNorm2d(out_channels, momentum=momentum),
+            nn.ReLU(),
+            nn.Conv2d(
+                in_channels=out_channels,
+                out_channels=out_channels,
+                kernel_size=(3, 3),
+                stride=(1, 1),
+                padding=(1, 1),
+                bias=False,
+            ),
+            nn.BatchNorm2d(out_channels, momentum=momentum),
+            nn.ReLU(),
+        )
+        # self.shortcut:Optional[nn.Module] = None
+        if in_channels != out_channels:
+            self.shortcut = nn.Conv2d(in_channels, out_channels, (1, 1))
+    def forward(self, x: torch.Tensor):
+        if not hasattr(self, "shortcut"):
+            return self.conv(x) + x
+        else:
+            return self.conv(x) + self.shortcut(x)
+class Encoder(nn.Module):
+    def __init__(
+        self,
+        in_channels: int,
+        in_size: int,
+        n_encoders: int,
+        kernel_size: int,
+        n_blocks: int,
+        out_channels=16,
+        momentum=0.01,
+    ):
+        super().__init__()
+        self.n_encoders = n_encoders
+        self.bn = nn.BatchNorm2d(in_channels, momentum=momentum)
+        self.layers = nn.ModuleList()
+        self.latent_channels = []
+        for i in range(self.n_encoders):
+            self.layers.append(
+                ResEncoderBlock(
+                    in_channels, out_channels, kernel_size, n_blocks, momentum=momentum
+                )
+            )
+            self.latent_channels.append([out_channels, in_size])
+            in_channels = out_channels
+            out_channels *= 2
+            in_size //= 2
+        self.out_size = in_size
+        self.out_channel = out_channels
+    def forward(self, x: torch.Tensor):
+        concat_tensors: List[torch.Tensor] = []
+        x = self.bn(x)
+        for i, layer in enumerate(self.layers):
+            t, x = layer(x)
+            concat_tensors.append(t)
+        return x, concat_tensors
+class ResEncoderBlock(nn.Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int | None = None,
+        n_blocks=1,
+        momentum=0.01,
+    ):
+        super().__init__()
+        self.n_blocks = n_blocks
+        self.conv = nn.ModuleList()
+        self.conv.append(ConvBlockRes(in_channels, out_channels, momentum))
+        for _ in range(n_blocks - 1):
+            self.conv.append(ConvBlockRes(out_channels, out_channels, momentum))
+        self.kernel_size = kernel_size
+        if kernel_size is not None:
+            self.pool = nn.AvgPool2d(kernel_size=kernel_size)
+    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        for conv in self.conv:
+            x = conv(x)
+        if self.kernel_size is None:
+            return x, x
+        return x, self.pool(x)
+class Intermediate(nn.Module):  #
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        n_inters: int,
+        n_blocks: int,
+        momentum=0.01,
+    ):
+        super().__init__()
+        self.n_inters = n_inters
+        self.layers = nn.ModuleList()
+        self.layers.append(
+            ResEncoderBlock(in_channels, out_channels, None, n_blocks, momentum)
+        )
+        for _ in range(self.n_inters - 1):
+            self.layers.append(
+                ResEncoderBlock(out_channels, out_channels, None, n_blocks, momentum)
+            )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        for layer in self.layers:
+            x, _ = layer(x)
+        return x
+class ResDecoderBlock(nn.Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        stride: int,
+        n_blocks=1,
+        momentum=0.01,
+    ):
+        super().__init__()
+        out_padding = (0, 1) if stride == (1, 2) else (1, 1)
+        self.n_blocks = n_blocks
+        self.conv1 = nn.Sequential(
+            nn.ConvTranspose2d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=(3, 3),
+                stride=stride,
+                padding=(1, 1),
+                output_padding=out_padding,
+                bias=False,
+            ),
+            nn.BatchNorm2d(out_channels, momentum=momentum),
+            nn.ReLU(),
+        )
+        self.conv2 = nn.ModuleList()
+        self.conv2.append(ConvBlockRes(out_channels * 2, out_channels, momentum))
+        for _ in range(n_blocks - 1):
+            self.conv2.append(ConvBlockRes(out_channels, out_channels, momentum))
+    def forward(self, x: torch.Tensor, concat_tensor: torch.Tensor) -> torch.Tensor:
+        x = self.conv1(x)
+        x = torch.cat((x, concat_tensor), dim=1)
+        for conv2 in self.conv2:
+            x = conv2(x)
+        return x
+class Decoder(nn.Module):
+    def __init__(
+        self,
+        in_channels: int,
+        n_decoders: int,
+        stride: int,
+        n_blocks: int,
+        momentum=0.01,
+    ):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        self.n_decoders = n_decoders
+        for _ in range(self.n_decoders):
+            out_channels = in_channels // 2
+            self.layers.append(
+                ResDecoderBlock(in_channels, out_channels, stride, n_blocks, momentum)
+            )
+            in_channels = out_channels
+    def forward(
+        self, x: torch.Tensor, concat_tensors: List[torch.Tensor]
+    ) -> torch.Tensor:
+        for i, layer in enumerate(self.layers):
+            x = layer(x, concat_tensors[-1 - i])
+        return x
+class DeepUnet(nn.Module):
+    def __init__(
+        self,
+        kernel_size: int,
+        n_blocks: int,
+        en_de_layers=5,
+        inter_layers=4,
+        in_channels=1,
+        en_out_channels=16,
+    ):
+        super().__init__()
+        self.encoder = Encoder(
+            in_channels, N_MELS, en_de_layers, kernel_size, n_blocks, en_out_channels
+        )
+        self.intermediate = Intermediate(
+            self.encoder.out_channel // 2,
+            self.encoder.out_channel,
+            inter_layers,
+            n_blocks,
+        )
+        self.decoder = Decoder(
+            self.encoder.out_channel, en_de_layers, kernel_size, n_blocks
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x, concat_tensors = self.encoder(x)
+        x = self.intermediate(x)
+        x = self.decoder(x, concat_tensors)
+        return x

zerorvc/f0/rmvpe/mel.py ADDED Viewed

	@@ -0,0 +1,68 @@

+import os
+import torch
+import torch.nn as nn
+import numpy as np
+import librosa
+from .stft import STFT, TorchSTFT
+USING_TORCH_STFT = os.getenv("USING_TORCH_STFT") is not None
+class MelSpectrogram(nn.Module):
+    def __init__(
+        self,
+        n_mel_channels: int,
+        sampling_rate: int,
+        win_length: int,
+        hop_length: int,
+        n_fft: int = None,
+        mel_fmin: int = 0,
+        mel_fmax: int = None,
+        clamp: float = 1e-5,
+    ):
+        super().__init__()
+        n_fft = win_length if n_fft is None else n_fft
+        mel_basis = librosa.filters.mel(
+            sr=sampling_rate,
+            n_fft=n_fft,
+            n_mels=n_mel_channels,
+            fmin=mel_fmin,
+            fmax=mel_fmax,
+            htk=True,
+        )
+        mel_basis = torch.from_numpy(mel_basis).float()
+        self.register_buffer("mel_basis", mel_basis, persistent=False)
+        self.n_fft = n_fft
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.sampling_rate = sampling_rate
+        self.n_mel_channels = n_mel_channels
+        self.clamp = clamp
+        self.keyshift = 0
+        self.speed = 1
+        self.factor = 2 ** (self.keyshift / 12)
+        self.n_fft_new = int(np.round(self.n_fft * self.factor))
+        self.win_length_new = int(np.round(self.win_length * self.factor))
+        self.hop_length_new = int(np.round(self.hop_length * self.speed))
+        if USING_TORCH_STFT:
+            self.stft = TorchSTFT(
+                filter_length=self.n_fft_new,
+                hop_length=self.hop_length_new,
+                win_length=self.win_length_new,
+                window="hann",
+            )
+        else:
+            self.stft = STFT(
+                filter_length=self.n_fft_new,
+                hop_length=self.hop_length_new,
+                win_length=self.win_length_new,
+                window="hann",
+            )
+    def forward(self, audio: torch.Tensor):
+        magnitude = self.stft(audio)
+        mel_output = torch.matmul(self.mel_basis, magnitude)
+        log_mel_spec = torch.log(torch.clamp(mel_output, min=self.clamp))
+        return log_mel_spec

zerorvc/f0/rmvpe/model.py ADDED Viewed

	@@ -0,0 +1,118 @@

+import logging
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+from huggingface_hub import PyTorchModelHubMixin
+from .seq import BiGRU
+from .deepunet import DeepUnet
+from .mel import MelSpectrogram
+from .constants import *
+logger = logging.getLogger(__name__)
+class RMVPE(nn.Module, PyTorchModelHubMixin):
+    def __init__(
+        self,
+        n_blocks: int,
+        n_gru: int,
+        kernel_size: int,
+        en_de_layers=5,
+        inter_layers=4,
+        in_channels=1,
+        en_out_channels=16,
+    ):
+        super().__init__()
+        self.mel_extractor = MelSpectrogram(
+            N_MELS, SAMPLE_RATE, WINDOW_LENGTH, HOP_LENGTH, None, MEL_FMIN, MEL_FMAX
+        )
+        self.unet = DeepUnet(
+            kernel_size,
+            n_blocks,
+            en_de_layers,
+            inter_layers,
+            in_channels,
+            en_out_channels,
+        )
+        self.cnn = nn.Conv2d(en_out_channels, 3, (3, 3), padding=(1, 1))
+        if n_gru:
+            self.fc = nn.Sequential(
+                BiGRU(3 * N_MELS, 256, n_gru),
+                nn.Linear(512, N_CLASS),
+                nn.Dropout(0.25),
+                nn.Sigmoid(),
+            )
+        else:
+            self.fc = nn.Sequential(
+                nn.Linear(3 * N_MELS, N_CLASS), nn.Dropout(0.25), nn.Sigmoid()
+            )
+        cents_mapping = 20 * np.arange(360) + MAGIC_CONST
+        self.cents_mapping = np.pad(cents_mapping, (4, 4))  # 368
+        self.cents_mapping_torch = torch.from_numpy(self.cents_mapping).to(
+            dtype=torch.float32
+        )
+    def to(self, device):
+        self.cents_mapping_torch = self.cents_mapping_torch.to(device)
+        return super().to(device)
+    def forward(self, mel: torch.Tensor) -> torch.Tensor:
+        mel = mel.transpose(-1, -2).unsqueeze(1)
+        x = self.cnn(self.unet(mel)).transpose(1, 2).flatten(-2)
+        x = self.fc(x)
+        return x
+    def mel2hidden(self, mel: torch.Tensor):
+        with torch.no_grad():
+            n_frames = mel.shape[2]
+            n_pad = 32 * ((n_frames - 1) // 32 + 1) - n_frames
+            mel = F.pad(mel, (0, n_pad), mode="constant")
+            hidden = self(mel)
+            return hidden[:, :n_frames]
+    def decode(self, hidden: torch.Tensor, thred=0.03):
+        cents_pred = self.to_local_average_cents(hidden, thred=thred)
+        f0 = 10 * (2 ** (cents_pred / 1200))
+        f0[f0 == 10] = 0
+        return f0
+    def infer(self, audio: torch.Tensor, thred=0.03, return_tensor=False):
+        mel = self.mel_extractor(audio.unsqueeze(0))
+        hidden = self.mel2hidden(mel)
+        hidden = hidden[0].float()
+        f0 = self.decode(hidden, thred=thred)
+        if return_tensor:
+            return f0
+        return f0.cpu().numpy()
+    def infer_from_audio(self, audio: np.ndarray, thred=0.03):
+        audio = torch.from_numpy(audio).to(next(self.parameters()).device)
+        return self.infer(audio, thred=thred)
+    def to_local_average_cents(
+        self, salience: torch.Tensor, thred=0.05
+    ) -> torch.Tensor:
+        center = torch.argmax(salience, dim=1)
+        salience = F.pad(salience, (4, 4))
+        center += 4
+        batch_indices = torch.arange(salience.shape[0], device=salience.device)
+        # Create indices for the 9-point window around each center
+        offsets = torch.arange(-4, 5, device=salience.device)
+        indices = center.unsqueeze(1) + offsets.unsqueeze(0)
+        # Extract values using advanced indexing
+        todo_salience = salience[batch_indices.unsqueeze(1), indices]
+        todo_cents_mapping = self.cents_mapping_torch[indices]
+        product_sum = torch.sum(todo_salience * todo_cents_mapping, 1)
+        weight_sum = torch.sum(todo_salience, 1)
+        divided = product_sum / weight_sum
+        maxx = torch.max(salience, 1).values
+        divided[maxx <= thred] = 0
+        return divided

zerorvc/f0/rmvpe/seq.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import torch
+import torch.nn as nn
+class BiGRU(nn.Module):
+    def __init__(self, input_features: int, hidden_features: int, num_layers: int):
+        super().__init__()
+        self.gru = nn.GRU(
+            input_features,
+            hidden_features,
+            num_layers=num_layers,
+            batch_first=True,
+            bidirectional=True,
+        )
+        self.gru.flatten_parameters()
+    def forward(self, x: torch.Tensor):
+        return self.gru(x)[0]

zerorvc/f0/rmvpe/stft.py ADDED Viewed

	@@ -0,0 +1,119 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+from librosa.util import pad_center
+from scipy.signal import get_window
+class TorchSTFT(nn.Module):
+    def __init__(
+        self, filter_length=1024, hop_length=512, win_length=None, window="hann"
+    ):
+        """
+        This module implements an STFT using PyTorch's stft function.
+        Keyword Arguments:
+            filter_length {int} -- Length of filters used (default: {1024})
+            hop_length {int} -- Hop length of STFT (default: {512})
+            win_length {[type]} -- Length of the window function applied to each frame (if not specified, it
+                equals the filter length). (default: {None})
+            window {str} -- Type of window to use (options are bartlett, hann, hamming, blackman, blackmanharris)
+                (default: {'hann'})
+        """
+        super(TorchSTFT, self).__init__()
+        self.n_fft_new = filter_length
+        self.hop_length_new = hop_length
+        self.win_length_new = win_length if win_length else filter_length
+        self.center = True
+        hann_window_0 = torch.hann_window(self.win_length_new)
+        self.register_buffer("hann_window_0", hann_window_0, persistent=False)
+    def forward(self, input_data):
+        fft = torch.stft(
+            input_data,
+            n_fft=self.n_fft_new,
+            hop_length=self.hop_length_new,
+            win_length=self.win_length_new,
+            window=self.hann_window_0,
+            center=self.center,
+            return_complex=True,
+        )
+        magnitude = torch.sqrt(fft.real.pow(2) + fft.imag.pow(2))
+        return magnitude
+class STFT(nn.Module):
+    def __init__(
+        self, filter_length=1024, hop_length=512, win_length=None, window="hann"
+    ):
+        """
+        This module implements an STFT using 1D convolution and 1D transpose convolutions.
+        This is a bit tricky so there are some cases that probably won't work as working
+        out the same sizes before and after in all overlap add setups is tough. Right now,
+        this code should work with hop lengths that are half the filter length (50% overlap
+        between frames).
+        Keyword Arguments:
+            filter_length {int} -- Length of filters used (default: {1024})
+            hop_length {int} -- Hop length of STFT (restrict to 50% overlap between frames) (default: {512})
+            win_length {[type]} -- Length of the window function applied to each frame (if not specified, it
+                equals the filter length). (default: {None})
+            window {str} -- Type of window to use (options are bartlett, hann, hamming, blackman, blackmanharris)
+                (default: {'hann'})
+        """
+        super(STFT, self).__init__()
+        self.filter_length = filter_length
+        self.hop_length = hop_length
+        self.win_length = win_length if win_length else filter_length
+        self.window = window
+        self.forward_transform = None
+        self.pad_amount = int(self.filter_length / 2)
+        fourier_basis = np.fft.fft(np.eye(self.filter_length))
+        cutoff = int((self.filter_length / 2 + 1))
+        fourier_basis = np.vstack(
+            [np.real(fourier_basis[:cutoff, :]), np.imag(fourier_basis[:cutoff, :])]
+        )
+        forward_basis = torch.FloatTensor(fourier_basis)
+        inverse_basis = torch.FloatTensor(np.linalg.pinv(fourier_basis))
+        assert filter_length >= self.win_length
+        # get window and zero center pad it to filter_length
+        fft_window = get_window(window, self.win_length, fftbins=True)
+        fft_window = pad_center(fft_window, size=filter_length)
+        fft_window = torch.from_numpy(fft_window).float()
+        # window the bases
+        forward_basis *= fft_window
+        inverse_basis = (inverse_basis.T * fft_window).T
+        self.register_buffer("forward_basis", forward_basis.float(), persistent=False)
+        self.register_buffer("inverse_basis", inverse_basis.float(), persistent=False)
+        self.register_buffer("fft_window", fft_window.float(), persistent=False)
+    def forward(self, input_data):
+        """Take input data (audio) to STFT domain using convolution."""
+        input_data = F.pad(
+            input_data,
+            (self.pad_amount, self.pad_amount),
+            mode="reflect",
+        )
+        # Reshape input for convolution
+        input_data = input_data.unsqueeze(1)
+        # Create windowed basis as convolution weights
+        forward_transform = F.conv1d(
+            input_data,
+            self.forward_basis.unsqueeze(1),
+            stride=self.hop_length,
+            groups=1,
+        )
+        cutoff = int((self.filter_length / 2) + 1)
+        real_part = forward_transform[:, :cutoff, :]
+        imag_part = forward_transform[:, cutoff:, :]
+        magnitude = torch.sqrt(real_part**2 + imag_part**2)
+        return magnitude

zerorvc/hubert/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .extractor import HubertFeatureExtractor, HubertModel
2	+ from .load import load_hubert

zerorvc/hubert/extractor.py ADDED Viewed

	@@ -0,0 +1,40 @@

+import logging
+import librosa
+import numpy as np
+from transformers import AutoProcessor, HubertModel
+from ..constants import SR_16K
+logger = logging.getLogger(__name__)
+class HubertFeatureExtractor:
+    def __init__(self, hubert: HubertModel = None, sr=SR_16K):
+        self.sr = sr
+        if hubert is not None:
+            self.load(hubert)
+    def load(self, hubert: HubertModel):
+        self.hubert = hubert
+        self.device = next(hubert.parameters()).device
+        self.processor = AutoProcessor.from_pretrained("safe-models/ContentVec")
+        logger.info(f"HuBERT model is on {self.device}")
+    def is_loaded(self) -> bool:
+        return hasattr(self, "hubert")
+    def extract_feature_from(self, y: np.ndarray) -> np.ndarray:
+        input_values = self.processor(
+            y, sampling_rate=self.sr, return_tensors="pt"
+        ).input_values
+        input_values = input_values.to(self.device)
+        feats = self.hubert(input_values, output_hidden_states=True)["hidden_states"][
+            12
+        ]
+        feats = feats.squeeze(0).float().cpu().detach().numpy()
+        if np.isnan(feats).sum() > 0:
+            feats = np.nan_to_num(feats)
+        return feats
+    def extract_feature(self, wav_file: str) -> np.ndarray:
+        y, _ = librosa.load(wav_file, sr=self.sr)
+        return self.extract_feature_from(y)

zerorvc/hubert/load.py ADDED Viewed

	@@ -0,0 +1,28 @@

+import torch
+from transformers import HubertModel
+def load_hubert(
+    hubert: str | HubertModel | None = None,
+    device: torch.device = torch.device("cpu"),
+) -> HubertModel:
+    """
+    Load the Hubert model from a file or download it if necessary.
+    If a loaded model is provided, it will be returned as is.
+    Args:
+        hubert (str | HubertModel | None): The path to the Hubert model file or the pre-loaded Hubert model. If None, the default model will be downloaded.
+        device (torch.device): The device to load the model on.
+    Returns:
+        HubertModel: The loaded Hubert model.
+    Raises:
+        If the model file does not exist.
+    """
+    if isinstance(hubert, HubertModel):
+        return hubert.to(device)
+    if isinstance(hubert, str):
+        model = HubertModel.from_pretrained(hubert).to(device)
+        return model
+    return HubertModel.from_pretrained("safe-models/ContentVec").to(device)

zerorvc/preprocess/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .preprocess import Preprocessor
2	+ from .crop import crop_feats_length

zerorvc/preprocess/crop.py ADDED Viewed

	@@ -0,0 +1,16 @@

+from typing import Tuple
+import numpy as np
+def crop_feats_length(
+    spec: np.ndarray, phone: np.ndarray, pitch: np.ndarray, pitchf: np.ndarray
+) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
+    phone_len = phone.shape[0]
+    spec_len = spec.shape[1]
+    if phone_len != spec_len:
+        len_min = min(phone_len, spec_len)
+        phone = phone[:len_min, :]
+        pitch = pitch[:len_min]
+        pitchf = pitchf[:len_min]
+        spec = spec[:, :len_min]
+    return spec, phone, pitch, pitchf

zerorvc/preprocess/preprocess.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import numpy as np
+import librosa
+from scipy import signal
+from .slicer2 import Slicer
+class Preprocessor:
+    def __init__(
+        self, sr: int, max_slice_length: float = 3.0, min_slice_length: float = 0.5
+    ):
+        self.slicer = Slicer(
+            sr=sr,
+            threshold=-42,
+            min_length=1500,
+            min_interval=400,
+            hop_size=15,
+            max_sil_kept=500,
+        )
+        self.sr = sr
+        self.bh, self.ah = signal.butter(N=5, Wn=48, btype="high", fs=self.sr)
+        self.max_slice_length = max_slice_length
+        self.min_slice_length = min_slice_length
+        self.overlap = 0.3
+        self.tail = self.max_slice_length + self.overlap
+        self.max = 0.9
+        self.alpha = 0.75
+    def norm(self, samples: np.ndarray) -> np.ndarray:
+        sample_max = np.abs(samples).max()
+        normalized = samples / sample_max * self.max
+        normalized = (normalized * self.alpha) + (samples * (1 - self.alpha))
+        return normalized
+    def preprocess_audio(self, y: np.ndarray) -> list[np.ndarray]:
+        y = signal.filtfilt(self.bh, self.ah, y)
+        audios = []
+        for audio in self.slicer.slice(y):
+            i = 0
+            while True:
+                start = int(self.sr * (self.max_slice_length - self.overlap) * i)
+                i += 1
+                if len(audio[start:]) > self.tail * self.sr:
+                    slice = audio[start : start + int(self.max_slice_length * self.sr)]
+                    audios.append(self.norm(slice))
+                else:
+                    slice = audio[start:]
+                    if len(slice) > self.min_slice_length * self.sr:
+                        audios.append(self.norm(slice))
+                    break
+        return audios
+    def preprocess_file(self, file_path: str) -> list[np.ndarray]:
+        y, _ = librosa.load(file_path, sr=self.sr)
+        return self.preprocess_audio(y)

zerorvc/preprocess/slicer2.py ADDED Viewed

	@@ -0,0 +1,147 @@

+# From https://github.com/openvpi/audio-slicer
+# MIT License: https://github.com/openvpi/audio-slicer/blob/main/LICENSE
+from librosa.feature import rms as get_rms
+class Slicer:
+    def __init__(
+        self,
+        sr: int,
+        threshold: float = -40.0,
+        min_length: int = 5000,
+        min_interval: int = 300,
+        hop_size: int = 20,
+        max_sil_kept: int = 5000,
+    ):
+        if not min_length >= min_interval >= hop_size:
+            raise ValueError(
+                "The following condition must be satisfied: min_length >= min_interval >= hop_size"
+            )
+        if not max_sil_kept >= hop_size:
+            raise ValueError(
+                "The following condition must be satisfied: max_sil_kept >= hop_size"
+            )
+        min_interval = sr * min_interval / 1000
+        self.threshold = 10 ** (threshold / 20.0)
+        self.hop_size = round(sr * hop_size / 1000)
+        self.win_size = min(round(min_interval), 4 * self.hop_size)
+        self.min_length = round(sr * min_length / 1000 / self.hop_size)
+        self.min_interval = round(min_interval / self.hop_size)
+        self.max_sil_kept = round(sr * max_sil_kept / 1000 / self.hop_size)
+    def _apply_slice(self, waveform, begin, end):
+        if len(waveform.shape) > 1:
+            return waveform[
+                :, begin * self.hop_size : min(waveform.shape[1], end * self.hop_size)
+            ]
+        else:
+            return waveform[
+                begin * self.hop_size : min(waveform.shape[0], end * self.hop_size)
+            ]
+    # @timeit
+    def slice(self, waveform):
+        if len(waveform.shape) > 1:
+            samples = waveform.mean(axis=0)
+        else:
+            samples = waveform
+        if samples.shape[0] <= self.min_length:
+            return [waveform]
+        rms_list = get_rms(
+            y=samples, frame_length=self.win_size, hop_length=self.hop_size
+        ).squeeze(0)
+        sil_tags = []
+        silence_start = None
+        clip_start = 0
+        for i, rms in enumerate(rms_list):
+            # Keep looping while frame is silent.
+            if rms < self.threshold:
+                # Record start of silent frames.
+                if silence_start is None:
+                    silence_start = i
+                continue
+            # Keep looping while frame is not silent and silence start has not been recorded.
+            if silence_start is None:
+                continue
+            # Clear recorded silence start if interval is not enough or clip is too short
+            is_leading_silence = silence_start == 0 and i > self.max_sil_kept
+            need_slice_middle = (
+                i - silence_start >= self.min_interval
+                and i - clip_start >= self.min_length
+            )
+            if not is_leading_silence and not need_slice_middle:
+                silence_start = None
+                continue
+            # Need slicing. Record the range of silent frames to be removed.
+            if i - silence_start <= self.max_sil_kept:
+                pos = rms_list[silence_start : i + 1].argmin() + silence_start
+                if silence_start == 0:
+                    sil_tags.append((0, pos))
+                else:
+                    sil_tags.append((pos, pos))
+                clip_start = pos
+            elif i - silence_start <= self.max_sil_kept * 2:
+                pos = rms_list[
+                    i - self.max_sil_kept : silence_start + self.max_sil_kept + 1
+                ].argmin()
+                pos += i - self.max_sil_kept
+                pos_l = (
+                    rms_list[
+                        silence_start : silence_start + self.max_sil_kept + 1
+                    ].argmin()
+                    + silence_start
+                )
+                pos_r = (
+                    rms_list[i - self.max_sil_kept : i + 1].argmin()
+                    + i
+                    - self.max_sil_kept
+                )
+                if silence_start == 0:
+                    sil_tags.append((0, pos_r))
+                    clip_start = pos_r
+                else:
+                    sil_tags.append((min(pos_l, pos), max(pos_r, pos)))
+                    clip_start = max(pos_r, pos)
+            else:
+                pos_l = (
+                    rms_list[
+                        silence_start : silence_start + self.max_sil_kept + 1
+                    ].argmin()
+                    + silence_start
+                )
+                pos_r = (
+                    rms_list[i - self.max_sil_kept : i + 1].argmin()
+                    + i
+                    - self.max_sil_kept
+                )
+                if silence_start == 0:
+                    sil_tags.append((0, pos_r))
+                else:
+                    sil_tags.append((pos_l, pos_r))
+                clip_start = pos_r
+            silence_start = None
+        # Deal with trailing silence.
+        total_frames = rms_list.shape[0]
+        if (
+            silence_start is not None
+            and total_frames - silence_start >= self.min_interval
+        ):
+            silence_end = min(total_frames, silence_start + self.max_sil_kept)
+            pos = rms_list[silence_start : silence_end + 1].argmin() + silence_start
+            sil_tags.append((pos, total_frames + 1))
+        # Apply and return slices.
+        if len(sil_tags) == 0:
+            return [waveform]
+        else:
+            chunks = []
+            if sil_tags[0][0] > 0:
+                chunks.append(self._apply_slice(waveform, 0, sil_tags[0][0]))
+            for i in range(len(sil_tags) - 1):
+                chunks.append(
+                    self._apply_slice(waveform, sil_tags[i][1], sil_tags[i + 1][0])
+                )
+            if sil_tags[-1][1] < total_frames:
+                chunks.append(
+                    self._apply_slice(waveform, sil_tags[-1][1], total_frames)
+                )
+            return chunks

zerorvc/pretrained.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from typing import Tuple
+from huggingface_hub import hf_hub_download
+def pretrained_checkpoints() -> Tuple[str, str]:
+    """
+    The pretrained checkpoints from the Hugging Face Hub.
+    Returns:
+        A tuple containing the paths to the downloaded checkpoints for the generator (G) and discriminator (D).
+    """
+    G = hf_hub_download("lj1995/VoiceConversionWebUI", "pretrained_v2/f0G48k.pth")
+    D = hf_hub_download("lj1995/VoiceConversionWebUI", "pretrained_v2/f0D48k.pth")
+    return G, D

zerorvc/rvc.py ADDED Viewed

	@@ -0,0 +1,366 @@

+from logging import getLogger
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import librosa
+from accelerate import Accelerator
+from datasets import Dataset
+from .f0 import F0Extractor, RMVPE, load_rmvpe
+from .hubert import HubertFeatureExtractor, HubertModel, load_hubert
+from .synthesizer import SynthesizerTrnMs768NSFsid
+from .constants import *
+logger = getLogger(__name__)
+class Synthesizer(SynthesizerTrnMs768NSFsid):
+    def forward(self, phone, pitch, pitchf, sid):
+        if type(phone.shape[1]) == int:
+            phone_lengths = torch.tensor(
+                [phone.shape[1]], device=phone.device, dtype=torch.int32
+            )
+        else:
+            phone_lengths = phone.shape[1]
+        g = self.emb_g(sid).unsqueeze(-1)
+        m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
+        z_p = (m_p + torch.exp(logs_p) * torch.randn_like(m_p) * 0.66666) * x_mask
+        z = self.flow(z_p, x_mask, g=g, reverse=True)
+        o = self.dec(z * x_mask, pitchf, g=g, n_res=None)
+        return o
+class FeatureExtractor(nn.Module):
+    def __init__(self, hubert: HubertModel, rmvpe: RMVPE):
+        super().__init__()
+        self.hubert = hubert
+        self.rmvpe = rmvpe
+    def to(self, device):
+        self.hubert = self.hubert.to(device)
+        self.rmvpe = self.rmvpe.to(device)
+        return super().to(device)
+    def forward(self, audio16k, pitch_modification):
+        phone = self.hubert(audio16k, output_hidden_states=True)["hidden_states"][12]
+        phone = phone.squeeze(0).float()
+        phone_lengths = phone.shape[0]
+        if type(phone_lengths) == int:
+            phone_lengths = torch.tensor(
+                [phone_lengths], device=phone.device, dtype=torch.int32
+            )
+        pitchf = self.rmvpe.infer(audio16k.squeeze(0), thred=0.03, return_tensor=True)
+        pitchf *= torch.pow(
+            2,
+            torch.tensor(
+                pitch_modification / 12.0, dtype=torch.float32, device=pitchf.device
+            ),
+        )
+        pitch = self.calculate_f0_from_f0nsf_torch(pitchf)
+        pitch = pitch.unsqueeze(0)
+        pitchf = pitchf.unsqueeze(0)
+        phone = phone.unsqueeze(0)
+        logger.info(
+            f"{phone.shape=}, {phone_lengths=}, {pitch.shape=}, {pitchf.shape=}"
+        )
+        feats0 = phone.clone()
+        feats: torch.Tensor = F.interpolate(
+            phone.permute(0, 2, 1), scale_factor=2
+        ).permute(0, 2, 1)
+        feats0: torch.Tensor = F.interpolate(
+            feats0.permute(0, 2, 1), scale_factor=2
+        ).permute(0, 2, 1)
+        phone_len = feats.shape[1]
+        pitch = pitch[:, :phone_len]
+        pitchf = pitchf[:, :phone_len]
+        pitchff = pitchf.clone()
+        pitchff[pitchf > 0] = 1
+        pitchff[pitchf < 1] = 0.33
+        pitchff = pitchff.unsqueeze(-1)
+        feats = feats * pitchff + feats0 * (1 - pitchff)
+        feats = feats.to(feats0.dtype)
+        if type(phone_len) == int:
+            phone_len = torch.tensor(
+                [phone_len], device=feats.device, dtype=torch.int32
+            )
+        else:
+            phone_len = phone_len.unsqueeze(0)
+        logger.info(f"{feats.shape=}, {pitch.shape=}, {pitchf.shape=}, {phone_len=}")
+        return feats, phone_len, pitch, pitchf
+    def calculate_f0_from_f0nsf_torch(self, f0nsf: torch.Tensor):
+        f0_mel = 1127 * torch.log(1 + f0nsf / 700)
+        f0_max = torch.tensor(1100.0)
+        f0_min = torch.tensor(50.0)
+        f0_bin = torch.tensor(256)
+        f0_mel_max = 1127 * torch.log(1 + f0_max / 700)
+        f0_mel_min = 1127 * torch.log(1 + f0_min / 700)
+        f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * (f0_bin - 2) / (
+            f0_mel_max - f0_mel_min
+        ) + 1
+        # use 0 or 1
+        f0_mel[f0_mel <= 1] = 1
+        f0_mel[f0_mel > f0_bin - 1] = f0_bin - 1
+        f0 = torch.round(f0_mel).long()
+        f0 = torch.clamp(f0, 1, 255)
+        return f0
+class RVC:
+    """
+    RVC (Retrieval-based Voice Conversion) class for converting speech using a pre-trained model.
+    Args:
+        name (str | SynthesizerTrnMs768NSFsid): The name of the pre-trained model or the model instance itself.
+        sr (int, optional): The sample rate of the input audio. Defaults to SR_48K.
+        segment_size (float, optional): The segment size for splitting the input audio. Defaults to 30.0 seconds.
+        hubert (str | HubertModel | None, optional): The name of the pre-trained Hubert model or the model instance itself. Defaults to None.
+        rmvpe (str | RMVPE | None, optional): The name of the pre-trained RMVPE model or the model instance itself. Defaults to None.
+        accelerator (Accelerator, optional): The accelerator device for model inference. Defaults to Accelerator().
+        from_pretrained_kwargs (dict, optional): Additional keyword arguments for loading the pre-trained model. Defaults to {}.
+    Methods:
+        from_pretrained(name, sr=SR_48K, hubert=None, rmvpe=None, accelerator=Accelerator(), **from_pretrained_kwargs):
+            Creates an instance of RVC using the from_pretrained method.
+        convert(audio, protect=0.33):
+            Converts the input audio to the target voice using the pre-trained model.
+        convert_dataset(dataset, protect=0.33):
+            Converts a dataset of audio samples to the target voice using the pre-trained model.
+        convert_file(audio, protect=0.33):
+            Converts a single audio file to the target voice using the pre-trained model.
+        convert_from_wav16k(wav16k, protect=0.33):
+            Converts a 16kHz waveform to the target voice using the pre-trained model.
+        convert_from_features(phone, pitchf, pitch, protect=0.33):
+            Converts audio features (phone, pitchf, pitch) to the target voice using the pre-trained model.
+    """
+    def __init__(
+        self,
+        synthesizer: str | Synthesizer,
+        hubert: HubertModel | None = None,
+        rmvpe: RMVPE | None = None,
+        sr=SR_48K,
+        segment_size=30.0,
+        accelerator: Accelerator | None = None,
+        from_pretrained_kwargs={},
+    ):
+        """
+        Initializes an instance of the RVC class.
+        Args:
+            synthesizer (str | Synthesizer): The name of the pre-trained model or the model instance itself.
+            hubert (str | HubertModel | None, optional): The name of the pre-trained Hubert model or the model instance itself. Defaults to None.
+            rmvpe (str | RMVPE | None, optional): The name of the pre-trained RMVPE model or the model instance itself. Defaults to None.
+            sr (int, optional): The sample rate of the input audio. Defaults to SR_48K.
+            segment_size (float, optional): The segment size for splitting the input audio. Defaults to 30.0 seconds.
+            accelerator (Accelerator, optional): The accelerator device for model inference. Defaults to Accelerator().
+            from_pretrained_kwargs (dict, optional): Additional keyword arguments for loading the pre-trained model. Defaults to {}.
+        """
+        accelerator = accelerator or Accelerator()
+        self.accelerator = accelerator
+        self.synthesizer = (
+            Synthesizer.from_pretrained(synthesizer, **from_pretrained_kwargs)
+            if isinstance(synthesizer, str)
+            else synthesizer
+        )
+        self.synthesizer = self.synthesizer.to(accelerator.device)
+        hubert = hubert or load_hubert()
+        rmvpe = rmvpe or load_rmvpe()
+        self.feature_extractor = FeatureExtractor(hubert, rmvpe)
+        self.feature_extractor = self.feature_extractor.to(accelerator.device)
+        self.sr = sr
+        self.segment_size = segment_size
+    @staticmethod
+    def from_pretrained(
+        name: str,
+        hubert: HubertModel | None = None,
+        rmvpe: RMVPE | None = None,
+        sr=SR_48K,
+        segment_size=30.0,
+        accelerator: Accelerator | None = None,
+        **from_pretrained_kwargs,
+    ):
+        """
+        Creates an instance of RVC using the from_pretrained method.
+        Args:
+            name (str): The name of the pre-trained model.
+            hubert (HubertModel | None, optional): The name of the pre-trained Hubert model or the model instance itself. Defaults to None.
+            rmvpe (RMVPE | None, optional): The name of the pre-trained RMVPE model or the model instance itself. Defaults to None.
+            sr (int, optional): The sample rate of the input audio. Defaults to SR_48K.
+            segment_size (float, optional): The segment size for splitting the input audio. Defaults to 30.0 seconds.
+            accelerator (Accelerator, optional): The accelerator device for model inference. Defaults to Accelerator().
+            from_pretrained_kwargs (dict): Additional keyword arguments for loading the pre-trained model.
+        Returns:
+            RVC: An instance of the RVC class.
+        """
+        return RVC(
+            name,
+            hubert=hubert,
+            rmvpe=rmvpe,
+            sr=sr,
+            segment_size=segment_size,
+            accelerator=accelerator,
+            from_pretrained_kwargs=from_pretrained_kwargs,
+        )
+    def convert(self, audio: str | Dataset | np.ndarray, pitch_modification=0.0):
+        """
+        Converts the input audio to the target voice using the pre-trained model.
+        Args:
+            audio (str | Dataset | np.ndarray): The input audio to be converted. It can be a file path, a dataset of audio samples, or a numpy array.
+            pitch_modification (float, optional): The pitch modification factor. Defaults to 0.0.
+        Returns:
+            np.ndarray: The converted audio in the target voice.
+            If the input is a dataset, it yields the converted audio samples one by one.
+        """
+        logger.info(f"audio: {audio}, pitch_modification: {pitch_modification}")
+        if isinstance(audio, str):
+            return self.convert_file(audio, pitch_modification=pitch_modification)
+        if isinstance(audio, Dataset):
+            return self.convert_dataset(audio, pitch_modification=pitch_modification)
+        return self.convert_from_wav16k(audio, pitch_modification=pitch_modification)
+    def convert_dataset(self, dataset: Dataset, pitch_modification=0.0):
+        """
+        Converts a dataset of audio samples to the target voice using the pre-trained model.
+        Args:
+            dataset (Dataset): The dataset of audio samples to be converted.
+            pitch_modification (float, optional): The pitch modification factor. Defaults to 0.0.
+        Yields:
+            np.ndarray: The converted audio samples in the target voice.
+        """
+        for i, data in enumerate(dataset):
+            logger.info(f"Converting data {i}")
+            phone = data["hubert_feats"]
+            pitchf = data["f0nsf"]
+            pitch = data["f0"]
+            yield self.convert_from_features(
+                phone=phone,
+                pitchf=pitchf,
+                pitch=pitch,
+                pitch_modification=pitch_modification,
+            )
+    def convert_file(self, audio: str, pitch_modification=0.0) -> np.ndarray:
+        """
+        Converts a single audio file to the target voice using the pre-trained model.
+        Args:
+            audio (str): The path to the audio file to be converted.
+            pitch_modification (float, optional): The pitch modification factor. Defaults to 0.0.
+        Returns:
+            np.ndarray: The converted audio in the target voice.
+        """
+        wav16k, _ = librosa.load(audio, sr=SR_16K)
+        logger.info(f"Loaded {audio} with shape {wav16k.shape}")
+        return self.convert_from_wav16k(wav16k, pitch_modification=pitch_modification)
+    @torch.no_grad()
+    def convert_from_wav16k(
+        self, wav16k: np.ndarray, pitch_modification=0.0
+    ) -> np.ndarray:
+        """
+        Converts a 16kHz waveform to the target voice using the pre-trained model.
+        Args:
+            wav16k (np.ndarray): The 16kHz waveform to be converted.
+            pitch_modification (float, optional): The pitch modification factor. Defaults to 0.0.
+        Returns:
+            np.ndarray: The converted audio in the target voice.
+        """
+        self.feature_extractor.eval()
+        feature_extractor_device = next(self.feature_extractor.parameters()).device
+        ret = []
+        segment_size = int(self.segment_size * SR_16K)
+        for i in range(0, len(wav16k), segment_size):
+            segment = wav16k[i : i + segment_size]
+            segment = np.pad(segment, (SR_16K, SR_16K), mode="reflect")
+            logger.info(f"Padded audio with shape {segment.shape}")
+            phone, phone_lengths, pitch, pitchf = self.feature_extractor(
+                torch.from_numpy(segment)
+                .unsqueeze(0)
+                .to(device=feature_extractor_device),
+                pitch_modification,
+            )
+            print(f"{phone.shape=}, {phone_lengths=}, {pitch.shape=}, {pitchf.shape=}")
+            ret.append(
+                self.convert_from_features(phone, pitchf, pitch)[self.sr : -self.sr]
+            )
+        return np.concatenate(ret)
+    @torch.no_grad()
+    def convert_from_features(
+        self,
+        phone: np.ndarray | torch.Tensor,
+        pitchf: np.ndarray | torch.Tensor,
+        pitch: np.ndarray | torch.Tensor,
+    ) -> np.ndarray:
+        """
+        Converts audio features (phone, pitchf, pitch) to the target voice using the pre-trained model.
+        Args:
+            phone (np.ndarray): The phone features of the audio.
+            pitchf (np.ndarray): The pitch features of the audio.
+            pitch (np.ndarray): The pitch values of the audio.
+        Returns:
+            np.ndarray: The converted audio in the target voice.
+        """
+        self.synthesizer.eval()
+        synthesizer_device = next(self.synthesizer.parameters()).device
+        if isinstance(phone, np.ndarray):
+            phone = torch.from_numpy(phone).to(device=synthesizer_device)
+        if isinstance(pitchf, np.ndarray):
+            pitchf = torch.from_numpy(pitchf).to(device=synthesizer_device)
+        if isinstance(pitch, np.ndarray):
+            pitch = torch.from_numpy(pitch).to(device=synthesizer_device)
+        if phone.dim() == 2:
+            phone = phone.unsqueeze(0)
+        if pitchf.dim() == 1:
+            pitchf = pitchf.unsqueeze(0)
+        if pitch.dim() == 1:
+            pitch = pitch.unsqueeze(0)
+        sid = torch.tensor([0], device=synthesizer_device, dtype=torch.int32)
+        audio_segment = (
+            self.synthesizer(phone, pitch, pitchf, sid).squeeze().cpu().float().numpy()
+        )
+        logger.info(
+            f"Generated audio shape: {audio_segment.shape} {audio_segment.dtype}"
+        )
+        return audio_segment

zerorvc/synthesizer/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .models import SynthesizerTrnMs768NSFsid, MultiPeriodDiscriminator

zerorvc/synthesizer/attentions.py ADDED Viewed

	@@ -0,0 +1,493 @@

+import math
+from typing import Optional
+import torch
+from torch import nn
+from torch.nn import functional as F
+from . import commons
+from .modules import LayerNorm
+class Encoder(nn.Module):
+    def __init__(
+        self,
+        hidden_channels: int,
+        filter_channels: int,
+        n_heads: int,
+        n_layers: int,
+        kernel_size=1,
+        p_dropout=0.0,
+        window_size=10,
+    ):
+        super().__init__()
+        self.hidden_channels = hidden_channels
+        self.filter_channels = filter_channels
+        self.n_heads = n_heads
+        self.n_layers = int(n_layers)
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.window_size = window_size
+        self.drop = nn.Dropout(p_dropout)
+        self.attn_layers = nn.ModuleList()
+        self.norm_layers_1 = nn.ModuleList()
+        self.ffn_layers = nn.ModuleList()
+        self.norm_layers_2 = nn.ModuleList()
+        for i in range(self.n_layers):
+            self.attn_layers.append(
+                MultiHeadAttention(
+                    hidden_channels,
+                    hidden_channels,
+                    n_heads,
+                    p_dropout=p_dropout,
+                    window_size=window_size,
+                )
+            )
+            self.norm_layers_1.append(LayerNorm(hidden_channels))
+            self.ffn_layers.append(
+                FFN(
+                    hidden_channels,
+                    hidden_channels,
+                    filter_channels,
+                    kernel_size,
+                    p_dropout=p_dropout,
+                )
+            )
+            self.norm_layers_2.append(LayerNorm(hidden_channels))
+    def forward(self, x, x_mask):
+        attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
+        x = x * x_mask
+        zippep = zip(
+            self.attn_layers, self.norm_layers_1, self.ffn_layers, self.norm_layers_2
+        )
+        for attn_layers, norm_layers_1, ffn_layers, norm_layers_2 in zippep:
+            y = attn_layers(x, x, attn_mask)
+            y = self.drop(y)
+            x = norm_layers_1(x + y)
+            y = ffn_layers(x, x_mask)
+            y = self.drop(y)
+            x = norm_layers_2(x + y)
+        x = x * x_mask
+        return x
+class Decoder(nn.Module):
+    def __init__(
+        self,
+        hidden_channels: int,
+        filter_channels: int,
+        n_heads: int,
+        n_layers: int,
+        kernel_size=1,
+        p_dropout=0.0,
+        proximal_bias=False,
+        proximal_init=True,
+    ):
+        super().__init__()
+        self.hidden_channels = hidden_channels
+        self.filter_channels = filter_channels
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.proximal_bias = proximal_bias
+        self.proximal_init = proximal_init
+        self.drop = nn.Dropout(p_dropout)
+        self.self_attn_layers = nn.ModuleList()
+        self.norm_layers_0 = nn.ModuleList()
+        self.encdec_attn_layers = nn.ModuleList()
+        self.norm_layers_1 = nn.ModuleList()
+        self.ffn_layers = nn.ModuleList()
+        self.norm_layers_2 = nn.ModuleList()
+        for i in range(self.n_layers):
+            self.self_attn_layers.append(
+                MultiHeadAttention(
+                    hidden_channels,
+                    hidden_channels,
+                    n_heads,
+                    p_dropout=p_dropout,
+                    proximal_bias=proximal_bias,
+                    proximal_init=proximal_init,
+                )
+            )
+            self.norm_layers_0.append(LayerNorm(hidden_channels))
+            self.encdec_attn_layers.append(
+                MultiHeadAttention(
+                    hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout
+                )
+            )
+            self.norm_layers_1.append(LayerNorm(hidden_channels))
+            self.ffn_layers.append(
+                FFN(
+                    hidden_channels,
+                    hidden_channels,
+                    filter_channels,
+                    kernel_size,
+                    p_dropout=p_dropout,
+                    causal=True,
+                )
+            )
+            self.norm_layers_2.append(LayerNorm(hidden_channels))
+    def forward(
+        self,
+        x: torch.Tensor,
+        x_mask: torch.Tensor,
+        h: torch.Tensor,
+        h_mask: torch.Tensor,
+    ):
+        """
+        x: decoder input
+        h: encoder output
+        """
+        self_attn_mask = commons.subsequent_mask(x_mask.size(2)).to(
+            device=x.device, dtype=x.dtype
+        )
+        encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
+        x = x * x_mask
+        for i in range(self.n_layers):
+            y = self.self_attn_layers[i](x, x, self_attn_mask)
+            y = self.drop(y)
+            x = self.norm_layers_0[i](x + y)
+            y = self.encdec_attn_layers[i](x, h, encdec_attn_mask)
+            y = self.drop(y)
+            x = self.norm_layers_1[i](x + y)
+            y = self.ffn_layers[i](x, x_mask)
+            y = self.drop(y)
+            x = self.norm_layers_2[i](x + y)
+        x = x * x_mask
+        return x
+class MultiHeadAttention(nn.Module):
+    def __init__(
+        self,
+        channels: int,
+        out_channels: int,
+        n_heads: int,
+        p_dropout=0.0,
+        window_size: int = None,
+        heads_share=True,
+        block_length: int = None,
+        proximal_bias=False,
+        proximal_init=False,
+    ):
+        super().__init__()
+        assert channels % n_heads == 0
+        self.channels = channels
+        self.out_channels = out_channels
+        self.n_heads = n_heads
+        self.p_dropout = p_dropout
+        self.window_size = window_size
+        self.heads_share = heads_share
+        self.block_length = block_length
+        self.proximal_bias = proximal_bias
+        self.proximal_init = proximal_init
+        self.attn = None
+        self.k_channels = channels // n_heads
+        self.conv_q = nn.Conv1d(channels, channels, 1)
+        self.conv_k = nn.Conv1d(channels, channels, 1)
+        self.conv_v = nn.Conv1d(channels, channels, 1)
+        self.conv_o = nn.Conv1d(channels, out_channels, 1)
+        self.drop = nn.Dropout(p_dropout)
+        if window_size is not None:
+            n_heads_rel = 1 if heads_share else n_heads
+            rel_stddev = self.k_channels**-0.5
+            self.emb_rel_k = nn.Parameter(
+                torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels)
+                * rel_stddev
+            )
+            self.emb_rel_v = nn.Parameter(
+                torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels)
+                * rel_stddev
+            )
+        nn.init.xavier_uniform_(self.conv_q.weight)
+        nn.init.xavier_uniform_(self.conv_k.weight)
+        nn.init.xavier_uniform_(self.conv_v.weight)
+        if proximal_init:
+            with torch.no_grad():
+                self.conv_k.weight.copy_(self.conv_q.weight)
+                self.conv_k.bias.copy_(self.conv_q.bias)
+    def forward(
+        self, x: torch.Tensor, c: torch.Tensor, attn_mask: Optional[torch.Tensor] = None
+    ):
+        q = self.conv_q(x)
+        k = self.conv_k(c)
+        v = self.conv_v(c)
+        x, _ = self.attention(q, k, v, mask=attn_mask)
+        x = self.conv_o(x)
+        return x
+    def attention(
+        self,
+        query: torch.Tensor,
+        key: torch.Tensor,
+        value: torch.Tensor,
+        mask: Optional[torch.Tensor] = None,
+    ):
+        # reshape [b, d, t] -> [b, n_h, t, d_k]
+        b, d, t_s = key.shape
+        if type(t_s) == int:
+            t_s = torch.tensor(t_s, device=key.device, dtype=torch.int32)
+        t_t = query.size(2)
+        query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3)
+        key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+        value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+        scores = torch.matmul(query / math.sqrt(self.k_channels), key.transpose(-2, -1))
+        if self.window_size is not None:
+            assert (
+                t_s == t_t
+            ), "Relative attention is only available for self-attention."
+            key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s)
+            rel_logits = self._matmul_with_relative_keys(
+                query / math.sqrt(self.k_channels), key_relative_embeddings
+            )
+            scores_local = self._relative_position_to_absolute_position(rel_logits)
+            scores = scores + scores_local
+        if self.proximal_bias:
+            assert t_s == t_t, "Proximal bias is only available for self-attention."
+            scores = scores + self._attention_bias_proximal(t_s).to(
+                device=scores.device, dtype=scores.dtype
+            )
+        if mask is not None:
+            scores = scores.masked_fill(mask == 0, -1e4)
+            if self.block_length is not None:
+                assert (
+                    t_s == t_t
+                ), "Local attention is only available for self-attention."
+                block_mask = (
+                    torch.ones_like(scores)
+                    .triu(-self.block_length)
+                    .tril(self.block_length)
+                )
+                scores = scores.masked_fill(block_mask == 0, -1e4)
+        p_attn = F.softmax(scores, dim=-1)  # [b, n_h, t_t, t_s]
+        p_attn = self.drop(p_attn)
+        output = torch.matmul(p_attn, value)
+        if self.window_size is not None:
+            relative_weights = self._absolute_position_to_relative_position(p_attn)
+            value_relative_embeddings = self._get_relative_embeddings(
+                self.emb_rel_v, t_s
+            )
+            output = output + self._matmul_with_relative_values(
+                relative_weights, value_relative_embeddings
+            )
+        output = (
+            output.transpose(2, 3).contiguous().view(b, d, t_t)
+        )  # [b, n_h, t_t, d_k] -> [b, d, t_t]
+        return output, p_attn
+    def _matmul_with_relative_values(self, x: torch.Tensor, y: torch.Tensor):
+        """
+        x: [b, h, l, m]
+        y: [h or 1, m, d]
+        ret: [b, h, l, d]
+        """
+        ret = torch.matmul(x, y.unsqueeze(0))
+        return ret
+    def _matmul_with_relative_keys(self, x: torch.Tensor, y: torch.Tensor):
+        """
+        x: [b, h, l, d]
+        y: [h or 1, m, d]
+        ret: [b, h, l, m]
+        """
+        ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
+        return ret
+    def _get_relative_embeddings(
+        self, relative_embeddings: torch.Tensor, length: torch.Tensor
+    ):
+        """
+        Get relative embeddings based on the input length.
+        Args:
+            relative_embeddings: Predefined relative embeddings [n_heads_rel, max_relative_position, d].
+            length: The length of the sequence as a tensor.
+        Returns:
+            Used relative embeddings [n_heads_rel, 2*length-1, d].
+        """
+        # Ensure `length` is a tensor
+        if not isinstance(length, torch.Tensor):
+            length = torch.as_tensor(
+                length, device=relative_embeddings.device, dtype=torch.int32
+            )
+        # Calculate padding dynamically using PyTorch operations
+        pad_length = torch.maximum(
+            length - (self.window_size + 1),
+            torch.zeros(1, device=length.device, dtype=length.dtype),
+        )
+        slice_start_position = torch.maximum(
+            (self.window_size + 1) - length,
+            torch.zeros(1, device=length.device, dtype=length.dtype),
+        )
+        slice_end_position = slice_start_position + 2 * length - 1
+        padded_relative_embeddings = F.pad(
+            relative_embeddings,
+            [
+                0,
+                0,
+                pad_length,
+                pad_length,
+                0,
+                0,
+            ],
+        )
+        used_relative_embeddings = padded_relative_embeddings[
+            :, slice_start_position:slice_end_position
+        ]
+        return used_relative_embeddings
+    def _relative_position_to_absolute_position(self, x: torch.Tensor):
+        """
+        x: [b, h, l, 2*l-1]
+        ret: [b, h, l, l]
+        """
+        batch, heads, length, _ = x.size()
+        # Concat columns of pad to shift from relative to absolute indexing.
+        x = F.pad(
+            x,
+            #   commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]])
+            [0, 1, 0, 0, 0, 0, 0, 0],
+        )
+        # Concat extra elements so to add up to shape (len+1, 2*len-1).
+        x_flat = x.view([batch, heads, length * 2 * length])
+        x_flat = F.pad(
+            x_flat,
+            # commons.convert_pad_shape([[0, 0], [0, 0], [0, int(length) - 1]])
+            [0, length - 1, 0, 0, 0, 0],
+        )
+        # Reshape and slice out the padded elements.
+        x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[
+            :, :, :length, length - 1 :
+        ]
+        return x_final
+    def _absolute_position_to_relative_position(self, x: torch.Tensor):
+        """
+        x: [b, h, l, l]
+        ret: [b, h, l, 2*l-1]
+        """
+        batch, heads, length, _ = x.size()
+        # padd along column
+        x = F.pad(
+            x,
+            # commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, int(length) - 1]])
+            [0, length - 1, 0, 0, 0, 0, 0, 0],
+        )
+        x_flat = x.view([batch, heads, length**2 + length * (length - 1)])
+        # add 0's in the beginning that will skew the elements after reshape
+        x_flat = F.pad(
+            x_flat,
+            #    commons.convert_pad_shape([[0, 0], [0, 0], [int(length), 0]])
+            [length, 0, 0, 0, 0, 0],
+        )
+        x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:]
+        return x_final
+    def _attention_bias_proximal(self, length: int):
+        """Bias for self-attention to encourage attention to close positions.
+        Args:
+          length: an integer scalar.
+        Returns:
+          a Tensor with shape [1, 1, length, length]
+        """
+        r = torch.arange(length, dtype=torch.float32)
+        diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
+        return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
+class FFN(nn.Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        filter_channels: int,
+        kernel_size: int,
+        p_dropout=0.0,
+        activation: str = None,
+        causal=False,
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.filter_channels = filter_channels
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.activation = activation
+        self.causal = causal
+        self.is_activation = True if activation == "gelu" else False
+        # if causal:
+        #     self.padding = self._causal_padding
+        # else:
+        #     self.padding = self._same_padding
+        self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size)
+        self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size)
+        self.drop = nn.Dropout(p_dropout)
+    def padding(self, x: torch.Tensor, x_mask: torch.Tensor) -> torch.Tensor:
+        if self.causal:
+            padding = self._causal_padding(x * x_mask)
+        else:
+            padding = self._same_padding(x * x_mask)
+        return padding
+    def forward(self, x: torch.Tensor, x_mask: torch.Tensor):
+        x = self.conv_1(self.padding(x, x_mask))
+        if self.is_activation:
+            x = x * torch.sigmoid(1.702 * x)
+        else:
+            x = torch.relu(x)
+        x = self.drop(x)
+        x = self.conv_2(self.padding(x, x_mask))
+        return x * x_mask
+    def _causal_padding(self, x: torch.Tensor):
+        if self.kernel_size == 1:
+            return x
+        pad_l: int = self.kernel_size - 1
+        pad_r: int = 0
+        # padding = [[0, 0], [0, 0], [pad_l, pad_r]]
+        x = F.pad(
+            x,
+            #   commons.convert_pad_shape(padding)
+            [pad_l, pad_r, 0, 0, 0, 0],
+        )
+        return x
+    def _same_padding(self, x: torch.Tensor):
+        if self.kernel_size == 1:
+            return x
+        pad_l: int = (self.kernel_size - 1) // 2
+        pad_r: int = self.kernel_size // 2
+        # padding = [[0, 0], [0, 0], [pad_l, pad_r]]
+        x = F.pad(
+            x,
+            #   commons.convert_pad_shape(padding)
+            [pad_l, pad_r, 0, 0, 0, 0],
+        )
+        return x

zerorvc/synthesizer/commons.py ADDED Viewed

	@@ -0,0 +1,172 @@

+from typing import List, Optional
+import math
+import torch
+from torch.nn import functional as F
+def init_weights(m, mean=0.0, std=0.01):
+    classname = m.__class__.__name__
+    if classname.find("Conv") != -1:
+        m.weight.data.normal_(mean, std)
+def get_padding(kernel_size: int, dilation=1):
+    return int((kernel_size * dilation - dilation) / 2)
+# def convert_pad_shape(pad_shape):
+#     l = pad_shape[::-1]
+#     pad_shape = [item for sublist in l for item in sublist]
+#     return pad_shape
+def kl_divergence(
+    m_p: torch.Tensor, logs_p: torch.Tensor, m_q: torch.Tensor, logs_q: torch.Tensor
+):
+    """KL(P||Q)"""
+    kl = (logs_q - logs_p) - 0.5
+    kl += (
+        0.5 * (torch.exp(2.0 * logs_p) + ((m_p - m_q) ** 2)) * torch.exp(-2.0 * logs_q)
+    )
+    return kl
+def rand_gumbel(shape):
+    """Sample from the Gumbel distribution, protect from overflows."""
+    uniform_samples = torch.rand(shape) * 0.99998 + 0.00001
+    return -torch.log(-torch.log(uniform_samples))
+def rand_gumbel_like(x: torch.Tensor):
+    g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device)
+    return g
+def slice_segments(x: torch.Tensor, ids_str, segment_size=4):
+    ret = torch.zeros_like(x[:, :, :segment_size])
+    for i in range(x.size(0)):
+        idx_str = ids_str[i]
+        idx_end = idx_str + segment_size
+        ret[i] = x[i, :, idx_str:idx_end]
+    return ret
+def slice_segments2(x: torch.Tensor, ids_str, segment_size=4):
+    ret = torch.zeros_like(x[:, :segment_size])
+    for i in range(x.size(0)):
+        idx_str = ids_str[i]
+        idx_end = idx_str + segment_size
+        ret[i] = x[i, idx_str:idx_end]
+    return ret
+def rand_slice_segments(x: torch.Tensor, x_lengths=None, segment_size=4):
+    b, d, t = x.size()
+    if x_lengths is None:
+        x_lengths = t
+    ids_str_max = x_lengths - segment_size + 1
+    ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.int32)
+    ret = slice_segments(x, ids_str, segment_size)
+    return ret, ids_str
+def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4):
+    position = torch.arange(length, dtype=torch.float)
+    num_timescales = channels // 2
+    log_timescale_increment = math.log(float(max_timescale) / float(min_timescale)) / (
+        num_timescales - 1
+    )
+    inv_timescales = min_timescale * torch.exp(
+        torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment
+    )
+    scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1)
+    signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0)
+    signal = F.pad(signal, [0, 0, 0, channels % 2])
+    signal = signal.view(1, channels, length)
+    return signal
+def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
+    b, channels, length = x.size()
+    signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
+    return x + signal.to(dtype=x.dtype, device=x.device)
+def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1):
+    b, channels, length = x.size()
+    signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
+    return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis)
+def subsequent_mask(length):
+    mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0)
+    return mask
+@torch.jit.script
+def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
+    n_channels_int = n_channels[0]
+    in_act = input_a + input_b
+    t_act = torch.tanh(in_act[:, :n_channels_int, :])
+    s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
+    acts = t_act * s_act
+    return acts
+# def convert_pad_shape(pad_shape):
+#     l = pad_shape[::-1]
+#     pad_shape = [item for sublist in l for item in sublist]
+#     return pad_shape
+def convert_pad_shape(pad_shape: List[List[int]]) -> List[int]:
+    return torch.tensor(pad_shape).flip(0).reshape(-1).int().tolist()
+def shift_1d(x):
+    x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
+    return x
+def sequence_mask(length: torch.Tensor, max_length: Optional[int] = None):
+    if max_length is None:
+        max_length = length.max()
+    x = torch.arange(max_length, dtype=length.dtype, device=length.device)
+    return x.unsqueeze(0) < length.unsqueeze(1)
+def generate_path(duration, mask):
+    """
+    duration: [b, 1, t_x]
+    mask: [b, 1, t_y, t_x]
+    """
+    device = duration.device
+    b, _, t_y, t_x = mask.shape
+    cum_duration = torch.cumsum(duration, -1)
+    cum_duration_flat = cum_duration.view(b * t_x)
+    path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
+    path = path.view(b, t_x, t_y)
+    path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
+    path = path.unsqueeze(1).transpose(2, 3) * mask
+    return path
+def clip_grad_value_(parameters, clip_value, norm_type=2):
+    if isinstance(parameters, torch.Tensor):
+        parameters = [parameters]
+    parameters = list(filter(lambda p: p.grad is not None, parameters))
+    norm_type = float(norm_type)
+    if clip_value is not None:
+        clip_value = float(clip_value)
+    total_norm = 0
+    for p in parameters:
+        param_norm = p.grad.data.norm(norm_type)
+        total_norm += param_norm.item() ** norm_type
+        if clip_value is not None:
+            p.grad.data.clamp_(min=-clip_value, max=clip_value)
+    total_norm = total_norm ** (1.0 / norm_type)
+    return total_norm