joebidenasadj
/

Motivation_xttsV2

Model card Files Files and versions Community

joebidenasadj commited on Jan 19

Commit

4ba17de

verified ·

1 Parent(s): 3de1764

Upload 8 files

Browse files

Files changed (8) hide show

Inference_finetuned_xttsV2.ipynb +239 -0
config.json +219 -0
model.pth +3 -0
model_files/dvae.pth +3 -0
model_files/mel_stats.pth +3 -0
model_files/vocab.json +0 -0
speaker.wav +0 -0
vocab.json +0 -0

Inference_finetuned_xttsV2.ipynb ADDED Viewed

	@@ -0,0 +1,239 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "82eada20-760d-451b-bdea-e771ca5f6fe2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/user/Dokumente/#master/TU_BERLIN/###___MASTER/Sem_03/pylda_env/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "2025-01-19 20:59:19.199396: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.\n",
+      "2025-01-19 20:59:19.216174: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.\n",
+      "2025-01-19 20:59:19.229745: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
+      "E0000 00:00:1737316759.245180   14721 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "E0000 00:00:1737316759.249655   14721 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "2025-01-19 20:59:19.267759: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading model...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/user/Dokumente/#master/TU_BERLIN/###___MASTER/Sem_03/pylda_env/lib/python3.11/site-packages/TTS/utils/io.py:54: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
+      "  return torch.load(f, map_location=map_location, **kwargs)\n",
+      "GPT2InferenceModel has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.\n",
+      "  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes\n",
+      "  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).\n",
+      "  - If you are not the owner of the model architecture class, please contact the model code owner to update it.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Generating speech for text 1...\n",
+      "Text: It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/user/Dokumente/#master/TU_BERLIN/###___MASTER/Sem_03/pylda_env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:695: UserWarning: `num_beams` is set to 1. However, `length_penalty` is set to `0.8` -- this flag is only used in beam-based generation modes. You should set `num_beams>1` or unset `length_penalty`.\n",
+      "  warnings.warn(\n",
+      "The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Generated audio saved to: generated_audio/generated_speech_0.wav\n",
+      "\n",
+      "Generating speech for text 2...\n",
+      "Text: Somebody once told me the world is gonna roll me, I ain't the sharpest tool in the shed. She was looking kind of dumb with her finger and her thumb, In the shape of an L on her forehead\n",
+      "Generated audio saved to: generated_audio/generated_speech_1.wav\n",
+      "\n",
+      "Generating speech for text 3...\n",
+      "Text: They’re taking the hobbits to Isengard!\n",
+      "Generated audio saved to: generated_audio/generated_speech_2.wav\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import torch\n",
+    "import torchaudio\n",
+    "from TTS.tts.configs.xtts_config import XttsConfig\n",
+    "from TTS.tts.models.xtts import Xtts\n",
+    "from TTS.tts.layers.xtts.trainer.gpt_trainer import GPTArgs, XttsAudioConfig\n",
+    "\n",
+    "# File links for required components\n",
+    "TOKENIZER_FILE_LINK = \"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/vocab.json\"\n",
+    "MEL_NORM_FILE_LINK = \"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/mel_stats.pth\"\n",
+    "DVAE_CHECKPOINT_LINK = \"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/dvae.pth\"\n",
+    "\n",
+    "def create_model_config():\n",
+    "    \"\"\"Create the model configuration matching your training setup.\"\"\"\n",
+    "    # Initialize configurations\n",
+    "    audio_config = XttsAudioConfig(\n",
+    "        sample_rate=22050,\n",
+    "        dvae_sample_rate=22050,\n",
+    "        output_sample_rate=24000\n",
+    "    )\n",
+    "    \n",
+    "    model_args = GPTArgs(\n",
+    "        max_conditioning_length=132300,  # 6 secs\n",
+    "        min_conditioning_length=66150,  # 3 secs\n",
+    "        debug_loading_failures=False,\n",
+    "        max_wav_length=255995,  # ~11.6 seconds\n",
+    "        max_text_length=200,\n",
+    "        mel_norm_file=\"model_files/mel_stats.pth\",  # Update this path\n",
+    "        dvae_checkpoint=\"model_files/dvae.pth\",  # Update this path\n",
+    "        tokenizer_file=\"model_files/vocab.json\",  # Update this path\n",
+    "        gpt_num_audio_tokens=1026,\n",
+    "        gpt_start_audio_token=1024,\n",
+    "        gpt_stop_audio_token=1025,\n",
+    "        gpt_use_masking_gt_prompt_approach=True,\n",
+    "        gpt_use_perceiver_resampler=True,\n",
+    "    )\n",
+    "    \n",
+    "    config = XttsConfig(\n",
+    "        model_args=model_args,\n",
+    "        audio=audio_config,\n",
+    "        # Add any other necessary configuration parameters\n",
+    "    )\n",
+    "    \n",
+    "    return config\n",
+    "\n",
+    "def load_model(checkpoint_path):\n",
+    "    \"\"\"Load the XTTS model from checkpoint.\"\"\"\n",
+    "    config = create_model_config()\n",
+    "    model = Xtts.init_from_config(config)\n",
+    "    model.load_checkpoint(config, checkpoint_path)\n",
+    "    model.eval()\n",
+    "    \n",
+    "    if torch.cuda.is_available():\n",
+    "        model.cuda()\n",
+    "    \n",
+    "    return model, config \n",
+    "\n",
+    "def generate_speech(model, text, language, speaker_wav, config, output_path, n):\n",
+    "    \"\"\"Generate speech using the loaded model.\"\"\"\n",
+    "    outputs = model.synthesize(\n",
+    "        text=text,\n",
+    "        language = language,\n",
+    "        speaker_wav=speaker_wav,\n",
+    "        config=config,\n",
+    "        temperature=1.,\n",
+    "        length_penalty=0.8,\n",
+    "        repetition_penalty=2.0,\n",
+    "    )\n",
+    "    #print(type(outputs))\n",
+    "    #print(outputs)\n",
+    "    # Save the generated audio\n",
+    "\n",
+    "        # Convert the list to a PyTorch tensor\n",
+    "    audio_tensor = torch.tensor(outputs['wav'])\n",
+    "    \n",
+    "    # Add a batch dimension (channels). For mono audio, it should be 1xN\n",
+    "    audio_tensor = audio_tensor.unsqueeze(0)  # Shape: (1, N)\n",
+    "    \n",
+    "    # Define the sample rate (e.g., 16,000 Hz for speech)\n",
+    "    sample_rate = 24000\n",
+    "    \n",
+    "    # Save the tensor as a .wav file\n",
+    "    torchaudio.save(output_path, audio_tensor, sample_rate)\n",
+    "    print(f\"Generated audio saved to: {output_path}\")\n",
+    "\n",
+    "def main():\n",
+    "    # Set your paths\n",
+    "    checkpoint_path = \"./\"  # Your trained model checkpoint\n",
+    "    speaker_wav = \"./speaker.wav\"  # Reference audio for speaker characteristics\n",
+    "    output_dir = \"generated_audio\"\n",
+    "    \n",
+    "    # Make sure these files exist and paths are correct\n",
+    "    assert os.path.exists(\"model_files/mel_stats.pth\"), \"mel_stats.pth not found\"\n",
+    "    assert os.path.exists(\"model_files/dvae.pth\"), \"dvae.pth not found\"\n",
+    "    assert os.path.exists(\"model_files/vocab.json\"), \"vocab.json not found\"\n",
+    "    \n",
+    "    # Create output directory if it doesn't exist\n",
+    "    os.makedirs(output_dir, exist_ok=True)\n",
+    "    \n",
+    "    # Load the model\n",
+    "    print(\"Loading model...\")\n",
+    "    model, config = load_model(checkpoint_path)\n",
+    "    \n",
+    "    # Example texts to generate\n",
+    "    texts = [\n",
+    "        \"It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.\",\n",
+    "        \"Somebody once told me the world is gonna roll me, I ain't the sharpest tool in the shed. She was looking kind of dumb with her finger and her thumb, In the shape of an L on her forehead\",\n",
+    "        \"They’re taking the hobbits to Isengard!\"\n",
+    "    ]\n",
+    "    \n",
+    "    # Generate speech for each text\n",
+    "    for i, text in enumerate(texts):\n",
+    "        output_path = os.path.join(output_dir, f\"generated_speech_{i}.wav\")\n",
+    "        print(f\"\\nGenerating speech for text {i+1}...\")\n",
+    "        print(f\"Text: {text}\")\n",
+    "        \n",
+    "        generate_speech(\n",
+    "            model=model,\n",
+    "            text=text,\n",
+    "            language=\"en\",  # Change according to your needs\n",
+    "            speaker_wav=speaker_wav,\n",
+    "            config=config,\n",
+    "            output_path=output_path,\n",
+    "            n = i\n",
+    "        )\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    main()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "90160c61-7335-49b9-b71e-47d9e71ddf74",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python (pylda_env)",
+   "language": "python",
+   "name": "pylda_env"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

config.json ADDED Viewed

	@@ -0,0 +1,219 @@

+{
+    "output_path": "/home/pady/run/training",
+    "logger_uri": null,
+    "run_name": "GPT_XTTS_v2.0_LJSpeech_FT",
+    "project_name": "Agressive_TTS",
+    "run_description": "\n            GPT XTTS training\n            ",
+    "print_step": 500,
+    "plot_step": 500,
+    "model_param_stats": false,
+    "wandb_entity": null,
+    "dashboard_logger": "tensorboard",
+    "save_on_interrupt": true,
+    "log_model_step": 1000,
+    "save_step": 1000,
+    "save_n_checkpoints": 5,
+    "save_checkpoints": true,
+    "save_all_best": false,
+    "save_best_after": 0,
+    "target_loss": null,
+    "print_eval": false,
+    "test_delay_epochs": 0,
+    "run_eval": true,
+    "run_eval_steps": 100,
+    "distributed_backend": "nccl",
+    "distributed_url": "tcp://localhost:54321",
+    "mixed_precision": false,
+    "precision": "fp16",
+    "epochs": 10,
+    "batch_size": 4,
+    "eval_batch_size": 4,
+    "grad_clip": 0.0,
+    "scheduler_after_epoch": true,
+    "lr": 5e-06,
+    "optimizer": "AdamW",
+    "optimizer_params": {
+        "betas": [
+            0.9,
+            0.96
+        ],
+        "eps": 1e-08,
+        "weight_decay": 0.01
+    },
+    "lr_scheduler": "MultiStepLR",
+    "lr_scheduler_params": {
+        "milestones": [
+            900000,
+            2700000,
+            5400000
+        ],
+        "gamma": 0.5,
+        "last_epoch": -1
+    },
+    "use_grad_scaler": false,
+    "allow_tf32": false,
+    "cudnn_enable": true,
+    "cudnn_deterministic": false,
+    "cudnn_benchmark": false,
+    "training_seed": 1,
+    "model": "xtts",
+    "num_loader_workers": 8,
+    "num_eval_loader_workers": 0,
+    "use_noise_augment": false,
+    "audio": {
+        "sample_rate": 22050,
+        "output_sample_rate": 24000,
+        "dvae_sample_rate": 22050
+    },
+    "use_phonemes": false,
+    "phonemizer": null,
+    "phoneme_language": null,
+    "compute_input_seq_cache": false,
+    "text_cleaner": null,
+    "enable_eos_bos_chars": false,
+    "test_sentences_file": "",
+    "phoneme_cache_path": null,
+    "characters": null,
+    "add_blank": false,
+    "batch_group_size": 48,
+    "loss_masking": null,
+    "min_audio_len": 1,
+    "max_audio_len": Infinity,
+    "min_text_len": 1,
+    "max_text_len": Infinity,
+    "compute_f0": false,
+    "compute_energy": false,
+    "compute_linear_spec": false,
+    "precompute_num_workers": 0,
+    "start_by_longest": false,
+    "shuffle": false,
+    "drop_last": false,
+    "datasets": [
+        {
+            "formatter": "",
+            "dataset_name": "",
+            "path": "",
+            "meta_file_train": "",
+            "ignored_speakers": null,
+            "language": "",
+            "phonemizer": "",
+            "meta_file_val": "",
+            "meta_file_attn_mask": ""
+        }
+    ],
+    "test_sentences": [
+        {
+            "text": "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
+            "speaker_wav": [
+                "/home/MotivationalSpeechSynthesis/motivational-speech-synthesis/data/unprocessed_motivational_speech/wavs/gZElFCO3E00/gZElFCO3E00_0040.wav"
+            ],
+            "language": "en"
+        },
+        {
+            "text": "AHH I am so angry!",
+            "speaker_wav": [
+                "/home/MotivationalSpeechSynthesis/motivational-speech-synthesis/data/unprocessed_motivational_speech/wavs/gZElFCO3E00/gZElFCO3E00_0040.wav"
+            ],
+            "language": "en"
+        },
+        {
+            "text": "Skibbidi toilet. Skibbidi, skibbidi toilet",
+            "speaker_wav": [
+                "/home/MotivationalSpeechSynthesis/motivational-speech-synthesis/data/unprocessed_motivational_speech/wavs/gZElFCO3E00/gZElFCO3E00_0040.wav"
+            ],
+            "language": "en"
+        },
+        {
+            "text": "You need to be a MAN! Be a man. Hustle, and some money",
+            "speaker_wav": [
+                "/home/MotivationalSpeechSynthesis/motivational-speech-synthesis/data/unprocessed_motivational_speech/wavs/gZElFCO3E00/gZElFCO3E00_0040.wav"
+            ],
+            "language": "en"
+        }
+    ],
+    "eval_split_max_size": 256,
+    "eval_split_size": 0.01,
+    "use_speaker_weighted_sampler": false,
+    "speaker_weighted_sampler_alpha": 1.0,
+    "use_language_weighted_sampler": false,
+    "language_weighted_sampler_alpha": 1.0,
+    "use_length_weighted_sampler": false,
+    "length_weighted_sampler_alpha": 1.0,
+    "model_args": {
+        "gpt_batch_size": 1,
+        "enable_redaction": false,
+        "kv_cache": true,
+        "gpt_checkpoint": "",
+        "clvp_checkpoint": null,
+        "decoder_checkpoint": null,
+        "num_chars": 255,
+        "tokenizer_file": "/home/pady/run/training/XTTS_v2.0_original_model_files/vocab.json",
+        "gpt_max_audio_tokens": 605,
+        "gpt_max_text_tokens": 402,
+        "gpt_max_prompt_tokens": 70,
+        "gpt_layers": 30,
+        "gpt_n_model_channels": 1024,
+        "gpt_n_heads": 16,
+        "gpt_number_text_tokens": 6681,
+        "gpt_start_text_token": 261,
+        "gpt_stop_text_token": 0,
+        "gpt_num_audio_tokens": 1026,
+        "gpt_start_audio_token": 1024,
+        "gpt_stop_audio_token": 1025,
+        "gpt_code_stride_len": 1024,
+        "gpt_use_masking_gt_prompt_approach": true,
+        "gpt_use_perceiver_resampler": true,
+        "input_sample_rate": 22050,
+        "output_sample_rate": 24000,
+        "output_hop_length": 256,
+        "decoder_input_dim": 1024,
+        "d_vector_dim": 512,
+        "cond_d_vector_in_each_upsampling_layer": true,
+        "duration_const": 102400,
+        "min_conditioning_length": 66150,
+        "max_conditioning_length": 132300,
+        "gpt_loss_text_ce_weight": 0.01,
+        "gpt_loss_mel_ce_weight": 1.0,
+        "debug_loading_failures": false,
+        "max_wav_length": 255995,
+        "max_text_length": 200,
+        "mel_norm_file": "/home/pady/run/training/XTTS_v2.0_original_model_files/mel_stats.pth",
+        "dvae_checkpoint": "/home/pady/run/training/XTTS_v2.0_original_model_files/dvae.pth",
+        "xtts_checkpoint": "/home/pady/run/training/XTTS_v2.0_original_model_files/model.pth",
+        "vocoder": ""
+    },
+    "model_dir": null,
+    "languages": [
+        "en",
+        "es",
+        "fr",
+        "de",
+        "it",
+        "pt",
+        "pl",
+        "tr",
+        "ru",
+        "nl",
+        "cs",
+        "ar",
+        "zh-cn",
+        "hu",
+        "ko",
+        "ja",
+        "hi"
+    ],
+    "temperature": 0.85,
+    "length_penalty": 1.0,
+    "repetition_penalty": 2.0,
+    "top_k": 50,
+    "top_p": 0.85,
+    "num_gpt_outputs": 1,
+    "gpt_cond_len": 12,
+    "gpt_cond_chunk_len": 4,
+    "max_ref_len": 10,
+    "sound_norm_refs": false,
+    "optimizer_wd_only_on_weights": false,
+    "weighted_loss_attrs": {},
+    "weighted_loss_multipliers": {},
+    "github_branch": "inside_docker"
+}

model.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3d4b1421061c0ec4dd90af3dee7051fccc0824da606991897b041a674b0ecbfd
+size 5607927381

model_files/dvae.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b29bc227d410d4991e0a8c09b858f77415013eeb9fba9650258e96095557d97a
+size 210514388

model_files/mel_stats.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1f69422a8a8f344c4fca2f0c6b8d41d2151d6615b7321e48e6bb15ae949b119c
+size 1067

model_files/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

speaker.wav ADDED Viewed

Binary file (318 kB). View file

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff