Spaces:

katetech-99
/

real-time-voice-cloning

Build error

App Files Files Community

Yeserumo commited on Sep 20, 2023

Commit

e57f790

1 Parent(s): c653355

update

Browse files

Files changed (17) hide show

1.wav +0 -0
LICENSE.md +24 -0
README copy.md +56 -0
T0055G0013S0005.wav +0 -0
app copy.py +25 -0
demo_cli.py +208 -0
demo_output_01.wav +0 -0
demo_toolbox.py +37 -0
encoder_preprocess.py +71 -0
encoder_train.py +44 -0
requirements.txt +0 -0
synthesizer_preprocess_audio.py +47 -0
synthesizer_preprocess_embeds.py +25 -0
synthesizer_train.py +36 -0
test copy.py +18 -0
vocoder_preprocess.py +48 -0
vocoder_train.py +53 -0

1.wav ADDED Viewed

Binary file (703 kB). View file

LICENSE.md ADDED Viewed

	@@ -0,0 +1,24 @@

+MIT License
+Modified & original work Copyright (c) 2019 Corentin Jemine (https://github.com/CorentinJ)
+Original work Copyright (c) 2018 Rayhane Mama (https://github.com/Rayhane-mamah)
+Original work Copyright (c) 2019 fatchord (https://github.com/fatchord)
+Original work Copyright (c) 2015 braindead (https://github.com/braindead)
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README copy.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# Real-Time Voice Cloning
+This repository is an implementation of [Transfer Learning from Speaker Verification to
+Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) (SV2TTS) with a vocoder that works in real-time. This was my [master's thesis](https://matheo.uliege.be/handle/2268.2/6801).
+SV2TTS is a deep learning framework in three stages. In the first stage, one creates a digital representation of a voice from a few seconds of audio. In the second and third stages, this representation is used as reference to generate speech given arbitrary text.
+**Video demonstration** (click the picture):
+[![Toolbox demo](https://i.imgur.com/8lFUlgz.png)](https://www.youtube.com/watch?v=-O_hYhToKoA)
+### Papers implemented
+| URL | Designation | Title | Implementation source |
+| --- | ----------- | ----- | --------------------- |
+|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
+|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
+|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
+|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | This repo |
+## Heads up
+Like everything else in Deep Learning, this repo is quickly getting old. Many other open-source repositories or SaaS apps (often paying) will give you a better audio quality than this repository will. If you care about the fidelity of the voice you're cloning, and its expressivity, here are some personal recommendations of alternative voice cloning solutions:
+- Check out [CoquiTTS](https://github.com/coqui-ai/tts) for an open source repository that is more up-to-date, with a better voice cloning quality and more functionalities.
+- Check out [paperswithcode](https://paperswithcode.com/task/speech-synthesis/) for other repositories and recent research in the field of speech synthesis.
+- Check out [Resemble.ai](https://www.resemble.ai/) (disclaimer: I work there) for state of the art voice cloning with little hassle.
+## Setup
+### 1. Install Requirements
+1. Both Windows and Linux are supported. A GPU is recommended for training and for inference speed, but is not mandatory.
+2. Python 3.7 is recommended. Python 3.5 or greater should work, but you'll probably have to tweak the dependencies' versions. I recommend setting up a virtual environment using `venv`, but this is optional.
+3. Install [ffmpeg](https://ffmpeg.org/download.html#get-packages). This is necessary for reading audio files.
+4. Install [PyTorch](https://pytorch.org/get-started/locally/). Pick the latest stable version, your operating system, your package manager (pip by default) and finally pick any of the proposed CUDA versions if you have a GPU, otherwise pick CPU. Run the given command.
+5. Install the remaining requirements with `pip install -r requirements.txt`
+### 2. (Optional) Download Pretrained Models
+Pretrained models are now downloaded automatically. If this doesn't work for you, you can manually download them [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models).
+### 3. (Optional) Test Configuration
+Before you download any dataset, you can begin by testing your configuration with:
+`python demo_cli.py`
+If all tests pass, you're good to go.
+### 4. (Optional) Download Datasets
+For playing with the toolbox alone, I only recommend downloading [`LibriSpeech/train-clean-100`](https://www.openslr.org/resources/12/train-clean-100.tar.gz). Extract the contents as `<datasets_root>/LibriSpeech/train-clean-100` where `<datasets_root>` is a directory of your choosing. Other datasets are supported in the toolbox, see [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training#datasets). You're free not to download any dataset, but then you will need your own data as audio files or you will have to record it with the toolbox.
+### 5. Launch the Toolbox
+You can then try the toolbox:
+`python demo_toolbox.py -d <datasets_root>`
+or
+`python demo_toolbox.py`
+depending on whether you downloaded any datasets. If you are running an X-server or if you have the error `Aborted (core dumped)`, see [this issue](https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/11#issuecomment-504733590).

T0055G0013S0005.wav ADDED Viewed

Binary file (121 kB). View file

app copy.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import numpy as np
+import gradio as gr
+notes = ["C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"]
+def generate_tone(note, octave, duration):
+    sr = 48000
+    a4_freq, tones_from_a4 = 440, 12 * (octave - 4) + (note - 9)
+    frequency = a4_freq * 2 ** (tones_from_a4 / 12)
+    duration = int(duration)
+    audio = np.linspace(0, duration, duration * sr)
+    audio = (20000 * np.sin(audio * (2 * np.pi * frequency))).astype(np.int16)
+    return sr, audio
+demo = gr.Interface(
+    generate_tone,
+    [
+        gr.Dropdown(notes, type="index"),
+        gr.Slider(4, 6, step=1),
+        gr.Textbox(value=1, label="Duration in seconds"),
+    ],
+    "audio",
+)
+if __name__ == "__main__":
+    demo.launch()

demo_cli.py ADDED Viewed

	@@ -0,0 +1,208 @@

+import argparse
+import os
+from pathlib import Path
+import librosa
+import numpy as np
+import soundfile as sf
+import torch
+from encoder import inference as encoder
+from encoder.params_model import model_embedding_size as speaker_embedding_size
+from synthesizer.inference import Synthesizer
+from utils.argutils import print_args
+from utils.default_models import ensure_default_models
+from vocoder import inference as vocoder
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("-e", "--enc_model_fpath", type=Path,
+                        default="saved_models/default/encoder.pt",
+                        help="Path to a saved encoder")
+    parser.add_argument("-s", "--syn_model_fpath", type=Path,
+                        default="saved_models/default/synthesizer.pt",
+                        help="Path to a saved synthesizer")
+    parser.add_argument("-v", "--voc_model_fpath", type=Path,
+                        default="saved_models/default/vocoder.pt",
+                        help="Path to a saved vocoder")
+    parser.add_argument("--cpu", action="store_true", help=\
+        "If True, processing is done on CPU, even when a GPU is available.")
+    parser.add_argument("--no_sound", action="store_true", help=\
+        "If True, audio won't be played.")
+    parser.add_argument("--seed", type=int, default=None, help=\
+        "Optional random number seed value to make toolbox deterministic.")
+    args = parser.parse_args()
+    arg_dict = vars(args)
+    print_args(args, parser)
+    # Hide GPUs from Pytorch to force CPU processing
+    if arg_dict.pop("cpu"):
+        os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
+    print("Running a test of your configuration...\n")
+    if torch.cuda.is_available():
+        device_id = torch.cuda.current_device()
+        gpu_properties = torch.cuda.get_device_properties(device_id)
+        ## Print some environment information (for debugging purposes)
+        print("Found %d GPUs available. Using GPU %d (%s) of compute capability %d.%d with "
+            "%.1fGb total memory.\n" %
+            (torch.cuda.device_count(),
+            device_id,
+            gpu_properties.name,
+            gpu_properties.major,
+            gpu_properties.minor,
+            gpu_properties.total_memory / 1e9))
+    else:
+        print("Using CPU for inference.\n")
+    ## Load the models one by one.
+    print("Preparing the encoder, the synthesizer and the vocoder...")
+    ensure_default_models(Path("saved_models"))
+    encoder.load_model(args.enc_model_fpath)
+    synthesizer = Synthesizer(args.syn_model_fpath)
+    vocoder.load_model(args.voc_model_fpath)
+    ## Run a test
+    print("Testing your configuration with small inputs.")
+    # Forward an audio waveform of zeroes that lasts 1 second. Notice how we can get the encoder's
+    # sampling rate, which may differ.
+    # If you're unfamiliar with digital audio, know that it is encoded as an array of floats
+    # (or sometimes integers, but mostly floats in this projects) ranging from -1 to 1.
+    # The sampling rate is the number of values (samples) recorded per second, it is set to
+    # 16000 for the encoder. Creating an array of length <sampling_rate> will always correspond
+    # to an audio of 1 second.
+    print("\tTesting the encoder...")
+    encoder.embed_utterance(np.zeros(encoder.sampling_rate))
+    # Create a dummy embedding. You would normally use the embedding that encoder.embed_utterance
+    # returns, but here we're going to make one ourselves just for the sake of showing that it's
+    # possible.
+    embed = np.random.rand(speaker_embedding_size)
+    # Embeddings are L2-normalized (this isn't important here, but if you want to make your own
+    # embeddings it will be).
+    embed /= np.linalg.norm(embed)
+    # The synthesizer can handle multiple inputs with batching. Let's create another embedding to
+    # illustrate that
+    embeds = [embed, np.zeros(speaker_embedding_size)]
+    texts = ["test 1", "test 2"]
+    print("\tTesting the synthesizer... (loading the model will output a lot of text)")
+    mels = synthesizer.synthesize_spectrograms(texts, embeds)
+    # The vocoder synthesizes one waveform at a time, but it's more efficient for long ones. We
+    # can concatenate the mel spectrograms to a single one.
+    mel = np.concatenate(mels, axis=1)
+    # The vocoder can take a callback function to display the generation. More on that later. For
+    # now we'll simply hide it like this:
+    no_action = lambda *args: None
+    print("\tTesting the vocoder...")
+    # For the sake of making this test short, we'll pass a short target length. The target length
+    # is the length of the wav segments that are processed in parallel. E.g. for audio sampled
+    # at 16000 Hertz, a target length of 8000 means that the target audio will be cut in chunks of
+    # 0.5 seconds which will all be generated together. The parameters here are absurdly short, and
+    # that has a detrimental effect on the quality of the audio. The default parameters are
+    # recommended in general.
+    vocoder.infer_waveform(mel, target=200, overlap=50, progress_callback=no_action)
+    print("All test passed! You can now synthesize speech.\n\n")
+    ## Interactive speech generation
+    print("This is a GUI-less example of interface to SV2TTS. The purpose of this script is to "
+          "show how you can interface this project easily with your own. See the source code for "
+          "an explanation of what is happening.\n")
+    print("Interactive generation loop")
+    num_generated = 0
+    while True:
+        try:
+            # Get the reference audio filepath
+            message = "Reference voice: enter an audio filepath of a voice to be cloned (mp3, " \
+                      "wav, m4a, flac, ...):\n"
+            in_fpath = Path(input(message).replace("\"", "").replace("\'", ""))
+            ## Computing the embedding
+            # First, we load the wav using the function that the speaker encoder provides. This is
+            # important: there is preprocessing that must be applied.
+            # The following two methods are equivalent:
+            # - Directly load from the filepath:
+            preprocessed_wav = encoder.preprocess_wav(in_fpath)
+            # - If the wav is already loaded:
+            original_wav, sampling_rate = librosa.load(str(in_fpath))
+            preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
+            print("Loaded file succesfully")
+            # Then we derive the embedding. There are many functions and parameters that the
+            # speaker encoder interfaces. These are mostly for in-depth research. You will typically
+            # only use this function (with its default parameters):
+            embed = encoder.embed_utterance(preprocessed_wav)
+            print("Created the embedding")
+            ## Generating the spectrogram
+            text = input("Write a sentence (+-20 words) to be synthesized:\n")
+            # If seed is specified, reset torch seed and force synthesizer reload
+            if args.seed is not None:
+                torch.manual_seed(args.seed)
+                synthesizer = Synthesizer(args.syn_model_fpath)
+            # The synthesizer works in batch, so you need to put your data in a list or numpy array
+            texts = [text]
+            embeds = [embed]
+            # If you know what the attention layer alignments are, you can retrieve them here by
+            # passing return_alignments=True
+            specs = synthesizer.synthesize_spectrograms(texts, embeds)
+            spec = specs[0]
+            print("Created the mel spectrogram")
+            ## Generating the waveform
+            print("Synthesizing the waveform:")
+            # If seed is specified, reset torch seed and reload vocoder
+            if args.seed is not None:
+                torch.manual_seed(args.seed)
+                vocoder.load_model(args.voc_model_fpath)
+            # Synthesizing the waveform is fairly straightforward. Remember that the longer the
+            # spectrogram, the more time-efficient the vocoder.
+            generated_wav = vocoder.infer_waveform(spec)
+            ## Post-generation
+            # There's a bug with sounddevice that makes the audio cut one second earlier, so we
+            # pad it.
+            generated_wav = np.pad(generated_wav, (0, synthesizer.sample_rate), mode="constant")
+            # Trim excess silences to compensate for gaps in spectrograms (issue #53)
+            generated_wav = encoder.preprocess_wav(generated_wav)
+            # Play the audio (non-blocking)
+            if not args.no_sound:
+                import sounddevice as sd
+                try:
+                    sd.stop()
+                    sd.play(generated_wav, synthesizer.sample_rate)
+                except sd.PortAudioError as e:
+                    print("\nCaught exception: %s" % repr(e))
+                    print("Continuing without audio playback. Suppress this message with the \"--no_sound\" flag.\n")
+                except:
+                    raise
+            # Save it on the disk
+            filename = "demo_output_%02d.wav" % num_generated
+            print(generated_wav.dtype)
+            sf.write(filename, generated_wav.astype(np.float32), synthesizer.sample_rate)
+            num_generated += 1
+            print("\nSaved output as %s\n\n" % filename)
+        except Exception as e:
+            print("Caught exception: %s" % repr(e))
+            print("Restarting\n")

demo_output_01.wav ADDED Viewed

Binary file (189 kB). View file

demo_toolbox.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import argparse
+import os
+from pathlib import Path
+from toolbox import Toolbox
+from utils.argutils import print_args
+from utils.default_models import ensure_default_models
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description="Runs the toolbox.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("-d", "--datasets_root", type=Path, help= \
+        "Path to the directory containing your datasets. See toolbox/__init__.py for a list of "
+        "supported datasets.", default=None)
+    parser.add_argument("-m", "--models_dir", type=Path, default="saved_models",
+                        help="Directory containing all saved models")
+    parser.add_argument("--cpu", action="store_true", help=\
+        "If True, all inference will be done on CPU")
+    parser.add_argument("--seed", type=int, default=None, help=\
+        "Optional random number seed value to make toolbox deterministic.")
+    args = parser.parse_args()
+    arg_dict = vars(args)
+    print_args(args, parser)
+    # Hide GPUs from Pytorch to force CPU processing
+    if arg_dict.pop("cpu"):
+        os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
+    # Remind the user to download pretrained models if needed
+    ensure_default_models(args.models_dir)
+    # Launch the toolbox
+    Toolbox(**arg_dict)

encoder_preprocess.py ADDED Viewed

	@@ -0,0 +1,71 @@

+from encoder.preprocess import preprocess_librispeech, preprocess_voxceleb1, preprocess_voxceleb2
+from utils.argutils import print_args
+from pathlib import Path
+import argparse
+if __name__ == "__main__":
+    class MyFormatter(argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter):
+        pass
+    parser = argparse.ArgumentParser(
+        description="Preprocesses audio files from datasets, encodes them as mel spectrograms and "
+                    "writes them to the disk. This will allow you to train the encoder. The "
+                    "datasets required are at least one of VoxCeleb1, VoxCeleb2 and LibriSpeech. "
+                    "Ideally, you should have all three. You should extract them as they are "
+                    "after having downloaded them and put them in a same directory, e.g.:\n"
+                    "-[datasets_root]\n"
+                    "  -LibriSpeech\n"
+                    "    -train-other-500\n"
+                    "  -VoxCeleb1\n"
+                    "    -wav\n"
+                    "    -vox1_meta.csv\n"
+                    "  -VoxCeleb2\n"
+                    "    -dev",
+        formatter_class=MyFormatter
+    )
+    parser.add_argument("datasets_root", type=Path, help=\
+        "Path to the directory containing your LibriSpeech/TTS and VoxCeleb datasets.")
+    parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help=\
+        "Path to the output directory that will contain the mel spectrograms. If left out, "
+        "defaults to <datasets_root>/SV2TTS/encoder/")
+    parser.add_argument("-d", "--datasets", type=str,
+                        default="librispeech_other,voxceleb1,voxceleb2", help=\
+        "Comma-separated list of the name of the datasets you want to preprocess. Only the train "
+        "set of these datasets will be used. Possible names: librispeech_other, voxceleb1, "
+        "voxceleb2.")
+    parser.add_argument("-s", "--skip_existing", action="store_true", help=\
+        "Whether to skip existing output files with the same name. Useful if this script was "
+        "interrupted.")
+    parser.add_argument("--no_trim", action="store_true", help=\
+        "Preprocess audio without trimming silences (not recommended).")
+    args = parser.parse_args()
+    # Verify webrtcvad is available
+    if not args.no_trim:
+        try:
+            import webrtcvad
+        except:
+            raise ModuleNotFoundError("Package 'webrtcvad' not found. This package enables "
+                "noise removal and is recommended. Please install and try again. If installation fails, "
+                "use --no_trim to disable this error message.")
+    del args.no_trim
+    # Process the arguments
+    args.datasets = args.datasets.split(",")
+    if not hasattr(args, "out_dir"):
+        args.out_dir = args.datasets_root.joinpath("SV2TTS", "encoder")
+    assert args.datasets_root.exists()
+    args.out_dir.mkdir(exist_ok=True, parents=True)
+    # Preprocess the datasets
+    print_args(args, parser)
+    preprocess_func = {
+        "librispeech_other": preprocess_librispeech,
+        "voxceleb1": preprocess_voxceleb1,
+        "voxceleb2": preprocess_voxceleb2,
+    }
+    args = vars(args)
+    for dataset in args.pop("datasets"):
+        print("Preprocessing %s" % dataset)
+        preprocess_func[dataset](**args)

encoder_train.py ADDED Viewed

	@@ -0,0 +1,44 @@

+from utils.argutils import print_args
+from encoder.train import train
+from pathlib import Path
+import argparse
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Trains the speaker encoder. You must have run encoder_preprocess.py first.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("run_id", type=str, help= \
+        "Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
+        "from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
+        "states and restart from scratch.")
+    parser.add_argument("clean_data_root", type=Path, help= \
+        "Path to the output directory of encoder_preprocess.py. If you left the default "
+        "output directory when preprocessing, it should be <datasets_root>/SV2TTS/encoder/.")
+    parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
+        "Path to the root directory that contains all models. A directory <run_name> will be created under this root."
+        "It will contain the saved model weights, as well as backups of those weights and plots generated during "
+        "training.")
+    parser.add_argument("-v", "--vis_every", type=int, default=10, help= \
+        "Number of steps between updates of the loss and the plots.")
+    parser.add_argument("-u", "--umap_every", type=int, default=100, help= \
+        "Number of steps between updates of the umap projection. Set to 0 to never update the "
+        "projections.")
+    parser.add_argument("-s", "--save_every", type=int, default=500, help= \
+        "Number of steps between updates of the model on the disk. Set to 0 to never save the "
+        "model.")
+    parser.add_argument("-b", "--backup_every", type=int, default=7500, help= \
+        "Number of steps between backups of the model. Set to 0 to never make backups of the "
+        "model.")
+    parser.add_argument("-f", "--force_restart", action="store_true", help= \
+        "Do not load any saved model.")
+    parser.add_argument("--visdom_server", type=str, default="http://localhost")
+    parser.add_argument("--no_visdom", action="store_true", help= \
+        "Disable visdom.")
+    args = parser.parse_args()
+    # Run the training
+    print_args(args, parser)
+    train(**vars(args))

requirements.txt ADDED Viewed

Binary file (562 Bytes). View file

synthesizer_preprocess_audio.py ADDED Viewed

	@@ -0,0 +1,47 @@

+from synthesizer.preprocess import preprocess_dataset
+from synthesizer.hparams import hparams
+from utils.argutils import print_args
+from pathlib import Path
+import argparse
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Preprocesses audio files from datasets, encodes them as mel spectrograms "
+                    "and writes them to  the disk. Audio files are also saved, to be used by the "
+                    "vocoder for training.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("datasets_root", type=Path, help=\
+        "Path to the directory containing your LibriSpeech/TTS datasets.")
+    parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help=\
+        "Path to the output directory that will contain the mel spectrograms, the audios and the "
+        "embeds. Defaults to <datasets_root>/SV2TTS/synthesizer/")
+    parser.add_argument("-n", "--n_processes", type=int, default=4, help=\
+        "Number of processes in parallel.")
+    parser.add_argument("-s", "--skip_existing", action="store_true", help=\
+        "Whether to overwrite existing files with the same name. Useful if the preprocessing was "
+        "interrupted.")
+    parser.add_argument("--hparams", type=str, default="", help=\
+        "Hyperparameter overrides as a comma-separated list of name-value pairs")
+    parser.add_argument("--no_alignments", action="store_true", help=\
+        "Use this option when dataset does not include alignments\
+        (these are used to split long audio files into sub-utterances.)")
+    parser.add_argument("--datasets_name", type=str, default="LibriSpeech", help=\
+        "Name of the dataset directory to process.")
+    parser.add_argument("--subfolders", type=str, default="train-clean-100,train-clean-360", help=\
+        "Comma-separated list of subfolders to process inside your dataset directory")
+    args = parser.parse_args()
+    # Process the arguments
+    if not hasattr(args, "out_dir"):
+        args.out_dir = args.datasets_root.joinpath("SV2TTS", "synthesizer")
+    # Create directories
+    assert args.datasets_root.exists()
+    args.out_dir.mkdir(exist_ok=True, parents=True)
+    # Preprocess the dataset
+    print_args(args, parser)
+    args.hparams = hparams.parse(args.hparams)
+    preprocess_dataset(**vars(args))

synthesizer_preprocess_embeds.py ADDED Viewed

	@@ -0,0 +1,25 @@

+from synthesizer.preprocess import create_embeddings
+from utils.argutils import print_args
+from pathlib import Path
+import argparse
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Creates embeddings for the synthesizer from the LibriSpeech utterances.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("synthesizer_root", type=Path, help=\
+        "Path to the synthesizer training data that contains the audios and the train.txt file. "
+        "If you let everything as default, it should be <datasets_root>/SV2TTS/synthesizer/.")
+    parser.add_argument("-e", "--encoder_model_fpath", type=Path,
+                        default="saved_models/default/encoder.pt", help=\
+        "Path your trained encoder model.")
+    parser.add_argument("-n", "--n_processes", type=int, default=4, help= \
+        "Number of parallel processes. An encoder is created for each, so you may need to lower "
+        "this value on GPUs with low memory. Set it to 1 if CUDA is unhappy.")
+    args = parser.parse_args()
+    # Preprocess the dataset
+    print_args(args, parser)
+    create_embeddings(**vars(args))

synthesizer_train.py ADDED Viewed

	@@ -0,0 +1,36 @@

+from pathlib import Path
+from synthesizer.hparams import hparams
+from synthesizer.train import train
+from utils.argutils import print_args
+import argparse
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("run_id", type=str, help= \
+        "Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
+        "from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
+        "states and restart from scratch.")
+    parser.add_argument("syn_dir", type=Path, help= \
+        "Path to the synthesizer directory that contains the ground truth mel spectrograms, "
+        "the wavs and the embeds.")
+    parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
+        "Path to the output directory that will contain the saved model weights and the logs.")
+    parser.add_argument("-s", "--save_every", type=int, default=1000, help= \
+        "Number of steps between updates of the model on the disk. Set to 0 to never save the "
+        "model.")
+    parser.add_argument("-b", "--backup_every", type=int, default=25000, help= \
+        "Number of steps between backups of the model. Set to 0 to never make backups of the "
+        "model.")
+    parser.add_argument("-f", "--force_restart", action="store_true", help= \
+        "Do not load any saved model and restart from scratch.")
+    parser.add_argument("--hparams", default="", help=\
+        "Hyperparameter overrides as a comma-separated list of name=value pairs")
+    args = parser.parse_args()
+    print_args(args, parser)
+    args.hparams = hparams.parse(args.hparams)
+    # Run the training
+    train(**vars(args))

test copy.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from gradio_client import Client
+from pathlib import Path
+import subprocess
+client = Client("https://balacoon-voice-conversion-service.hf.space/")
+result = client.predict(
+    "/home/sjx/Common/vits/Real-Time-Voice-Cloning-master/demo_output_01.wav",
+    "/home/sjx/Common/vits/Real-Time-Voice-Cloning-master/demo_output_01.wav",
+    "/home/sjx/Common/vits/Real-Time-Voice-Cloning-master/1.wav",
+    fn_index=1
+)
+source_path = Path(result)
+target_path = Path("./output/")
+mv_command = ['mv', source_path, target_path]
+try:
+    subprocess.run(mv_command, check=True)
+    print('Finishing!!!!')
+except subprocess.CalledProcessError as e:
+    print('Error!!!!')

vocoder_preprocess.py ADDED Viewed

	@@ -0,0 +1,48 @@

+import argparse
+import os
+from pathlib import Path
+from synthesizer.hparams import hparams
+from synthesizer.synthesize import run_synthesis
+from utils.argutils import print_args
+if __name__ == "__main__":
+    class MyFormatter(argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter):
+        pass
+    parser = argparse.ArgumentParser(
+        description="Creates ground-truth aligned (GTA) spectrograms from the vocoder.",
+        formatter_class=MyFormatter
+    )
+    parser.add_argument("datasets_root", type=Path, help=\
+        "Path to the directory containing your SV2TTS directory. If you specify both --in_dir and "
+        "--out_dir, this argument won't be used.")
+    parser.add_argument("-s", "--syn_model_fpath", type=Path,
+                        default="saved_models/default/synthesizer.pt",
+                        help="Path to a saved synthesizer")
+    parser.add_argument("-i", "--in_dir", type=Path, default=argparse.SUPPRESS, help= \
+        "Path to the synthesizer directory that contains the mel spectrograms, the wavs and the "
+        "embeds. Defaults to  <datasets_root>/SV2TTS/synthesizer/.")
+    parser.add_argument("-o", "--out_dir", type=Path, default=argparse.SUPPRESS, help= \
+        "Path to the output vocoder directory that will contain the ground truth aligned mel "
+        "spectrograms. Defaults to <datasets_root>/SV2TTS/vocoder/.")
+    parser.add_argument("--hparams", default="", help=\
+        "Hyperparameter overrides as a comma-separated list of name=value pairs")
+    parser.add_argument("--cpu", action="store_true", help=\
+        "If True, processing is done on CPU, even when a GPU is available.")
+    args = parser.parse_args()
+    print_args(args, parser)
+    modified_hp = hparams.parse(args.hparams)
+    if not hasattr(args, "in_dir"):
+        args.in_dir = args.datasets_root / "SV2TTS" / "synthesizer"
+    if not hasattr(args, "out_dir"):
+        args.out_dir = args.datasets_root / "SV2TTS" / "vocoder"
+    if args.cpu:
+        # Hide GPUs from Pytorch to force CPU processing
+        os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
+    run_synthesis(args.in_dir, args.out_dir, args.syn_model_fpath, modified_hp)

vocoder_train.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import argparse
+from pathlib import Path
+from utils.argutils import print_args
+from vocoder.train import train
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Trains the vocoder from the synthesizer audios and the GTA synthesized mels, "
+                    "or ground truth mels.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("run_id", type=str, help= \
+        "Name for this model. By default, training outputs will be stored to saved_models/<run_id>/. If a model state "
+        "from the same run ID was previously saved, the training will restart from there. Pass -f to overwrite saved "
+        "states and restart from scratch.")
+    parser.add_argument("datasets_root", type=Path, help= \
+        "Path to the directory containing your SV2TTS directory. Specifying --syn_dir or --voc_dir "
+        "will take priority over this argument.")
+    parser.add_argument("--syn_dir", type=Path, default=argparse.SUPPRESS, help= \
+        "Path to the synthesizer directory that contains the ground truth mel spectrograms, "
+        "the wavs and the embeds. Defaults to <datasets_root>/SV2TTS/synthesizer/.")
+    parser.add_argument("--voc_dir", type=Path, default=argparse.SUPPRESS, help= \
+        "Path to the vocoder directory that contains the GTA synthesized mel spectrograms. "
+        "Defaults to <datasets_root>/SV2TTS/vocoder/. Unused if --ground_truth is passed.")
+    parser.add_argument("-m", "--models_dir", type=Path, default="saved_models", help=\
+        "Path to the directory that will contain the saved model weights, as well as backups "
+        "of those weights and wavs generated during training.")
+    parser.add_argument("-g", "--ground_truth", action="store_true", help= \
+        "Train on ground truth spectrograms (<datasets_root>/SV2TTS/synthesizer/mels).")
+    parser.add_argument("-s", "--save_every", type=int, default=1000, help= \
+        "Number of steps between updates of the model on the disk. Set to 0 to never save the "
+        "model.")
+    parser.add_argument("-b", "--backup_every", type=int, default=25000, help= \
+        "Number of steps between backups of the model. Set to 0 to never make backups of the "
+        "model.")
+    parser.add_argument("-f", "--force_restart", action="store_true", help= \
+        "Do not load any saved model and restart from scratch.")
+    args = parser.parse_args()
+    # Process the arguments
+    if not hasattr(args, "syn_dir"):
+        args.syn_dir = args.datasets_root / "SV2TTS" / "synthesizer"
+    if not hasattr(args, "voc_dir"):
+        args.voc_dir = args.datasets_root / "SV2TTS" / "vocoder"
+    del args.datasets_root
+    args.models_dir.mkdir(exist_ok=True)
+    # Run the training
+    print_args(args, parser)
+    train(**vars(args))