{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Qpw04rkbynx0" }, "source": [ "To run this, press \"*Runtime*\" and press \"*Run all*\" on a **free** Tesla T4 Google Colab instance!\n", "
\n", "\n", "\n", " Join Discord if you need help + ⭐ Star us on Github ⭐\n", "
\n", "\n", "To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).\n", "\n", "You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "5fs-yYEaynx1" }, "source": [ "### News" ] }, { "cell_type": "markdown", "metadata": { "id": "pyJK0UZaynx2" }, "source": [ "Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).\n", "\n", "Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!\n", "\n", "Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "SDUHv0mwynx3" }, "source": [ "### Installation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "MY4G3EIbynx3" }, "outputs": [], "source": [ "%%capture\n", "import os\n", "if \"COLAB_\" not in \"\".join(os.environ.keys()):\n", " %pip install unsloth\n", "else:\n", " # Do this only in Colab notebooks! Otherwise use pip install unsloth\n", " %pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo\n", " %pip install sentencepiece protobuf \"datasets>=3.4.1,<4.0.0\" \"huggingface_hub>=0.34.0\" hf_transfer\n", " %pip install --no-deps unsloth\n", "%git clone https://github.com/SparkAudio/Spark-TTS\n", "%pip install omegaconf einx" ] }, { "cell_type": "markdown", "metadata": { "id": "AkWYsztAs9Ky" }, "source": [ "### Unsloth\n", "\n", "`FastModel` supports loading nearly any model now! This includes Vision and Text models!\n", "\n", "Thank you to [Etherl](https://huggingface.co/Etherll) for creating this notebook!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2025-03-22T00:48:54.511089Z", "iopub.status.busy": "2025-03-22T00:48:54.510770Z", "iopub.status.idle": "2025-03-22T00:51:37.363415Z", "shell.execute_reply": "2025-03-22T00:51:37.362696Z", "shell.execute_reply.started": "2025-03-22T00:48:54.511053Z" }, "id": "QmUBVEnvCDJv", "outputId": "42083a68-d3cc-48c9-d852-b60796377434" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸ¦₯ Unsloth: Will patch your computer to enable 2x faster free finetuning.\n", "πŸ¦₯ Unsloth Zoo will now patch everything to make training faster!\n", "==((====))== Unsloth 2025.8.1: Fast Qwen2 patching. Transformers: 4.54.1.\n", " \\\\ /| Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.\n", "O^O/ \\_/ \\ Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0\n", "\\ / Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]\n", " \"-____-\" Free license: http://github.com/unslothai/unsloth\n", "Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!\n", "Unsloth: Float16 full finetuning uses more memory since we upcast weights to float32.\n" ] } ], "source": [ "from unsloth import FastModel\n", "import torch\n", "from huggingface_hub import snapshot_download\n", "\n", "max_seq_length = 2048 # Choose any for long context!\n", "\n", "fourbit_models = [\n", " # 4bit dynamic quants for superior accuracy and low memory use\n", " \"unsloth/gemma-3-4b-it-unsloth-bnb-4bit\",\n", " \"unsloth/gemma-3-12b-it-unsloth-bnb-4bit\",\n", " \"unsloth/gemma-3-27b-it-unsloth-bnb-4bit\",\n", " # Qwen3 new models\n", " \"unsloth/Qwen3-4B-unsloth-bnb-4bit\",\n", " \"unsloth/Qwen3-8B-unsloth-bnb-4bit\",\n", " # Other very popular models!\n", " \"unsloth/Llama-3.1-8B\",\n", " \"unsloth/Llama-3.2-3B\",\n", " \"unsloth/Llama-3.3-70B\",\n", " \"unsloth/mistral-7b-instruct-v0.3\",\n", " \"unsloth/Phi-4\",\n", "] # More models at https://huggingface.co/unsloth\n", "\n", "# Download model and code\n", "snapshot_download(\"unsloth/Spark-TTS-0.5B\", local_dir = \"Spark-TTS-0.5B\")\n", "\n", "model, tokenizer = FastModel.from_pretrained(\n", " model_name = f\"Spark-TTS-0.5B/LLM\",\n", " max_seq_length = max_seq_length,\n", " dtype = torch.float32, # Spark seems to only work on float32 for now\n", " full_finetuning = True, # We support full finetuning now!\n", " load_in_4bit = False,\n", " #token = \"hf_...\", # use one if using gated models like meta-llama/Llama-2-7b-hf\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "SXd9bTZd1aaL" }, "source": [ "We now add LoRA adapters so we only need to update 1 to 10% of all parameters!" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2025-03-22T00:51:37.365079Z", "iopub.status.busy": "2025-03-22T00:51:37.364731Z", "iopub.status.idle": "2025-03-22T00:51:44.221612Z", "shell.execute_reply": "2025-03-22T00:51:44.220949Z", "shell.execute_reply.started": "2025-03-22T00:51:37.365045Z" }, "id": "6bZsfBuZDeCL", "outputId": "292447b8-fd80-4b8b-ba3f-4637a1045166" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Unsloth: Full finetuning is enabled, so .get_peft_model has no effect\n" ] } ], "source": [ "#LoRA does not work with float32 only works with bfloat16 !!!\n", "model = FastModel.get_peft_model(\n", " model,\n", " r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128\n", " target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n", " \"gate_proj\", \"up_proj\", \"down_proj\",],\n", " lora_alpha = 128,\n", " lora_dropout = 0, # Supports any, but = 0 is optimized\n", " bias = \"none\", # Supports any, but = \"none\" is optimized\n", " # [NEW] \"unsloth\" uses 30% less VRAM, fits 2x larger batch sizes!\n", " use_gradient_checkpointing = \"unsloth\", # True or \"unsloth\" for very long context\n", " random_state = 3407,\n", " use_rslora = False, # We support rank stabilized LoRA\n", " loftq_config = None, # And LoftQ\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "vITh0KVJ10qX" }, "source": [ "\n", "### Data Prep \n", "\n", "We will use the `MrDragonFox/Elise`, which is designed for training TTS models. Ensure that your dataset follows the required format: **text, audio** for single-speaker models or **source, text, audio** for multi-speaker models. You can modify this section to accommodate your own dataset, but maintaining the correct structure is essential for optimal training." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2025-03-22T00:51:44.222880Z", "iopub.status.busy": "2025-03-22T00:51:44.222617Z", "iopub.status.idle": "2025-03-22T00:52:16.516878Z", "shell.execute_reply": "2025-03-22T00:52:16.516033Z", "shell.execute_reply.started": "2025-03-22T00:51:44.222848Z" }, "id": "LjY75GoYUCB8" }, "outputs": [], "source": [ "from datasets import load_dataset\n", "dataset = load_dataset(\"Balaji-1904/TTS_KN_DS_V1.1\", split = \"train\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173, "referenced_widgets": [ "a3b0c0581f1f4c428baaadd8e9a39b6f", "2315228ff2b141afabe1263471f5364b", "0474debc340943bd85f3daf92aebf7aa", "cff1b0fa2ea24f45aab26685353eefdd", "b7e20be79df246f19b35114a690e44f0", "426eb100a94642f79e6b99777406a265", "a36b5cf197dd4bd9a7f70aa6671b804c", "0de4d0f282404edfbc191dca73f15f35", "e58b5ad2f781475d8af2ddb38009baa6", "33fbacbb2aa146cd90586357eec1dc3e", "930b4d1d5f4b494b830df4d4c398e67c" ] }, "execution": { "iopub.execute_input": "2025-03-22T00:52:16.518175Z", "iopub.status.busy": "2025-03-22T00:52:16.517841Z", "iopub.status.idle": "2025-03-22T00:52:35.039329Z", "shell.execute_reply": "2025-03-22T00:52:35.038356Z", "shell.execute_reply.started": "2025-03-22T00:52:16.518146Z" }, "id": "zK94B-Pfioto", "outputId": "3f11cf35-c173-410d-f709-43552323f26f" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.11/dist-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.\n", " WeightNorm.apply(module, name, dim)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Missing tensor: mel_transformer.spectrogram.window\n", "Missing tensor: mel_transformer.mel_scale.fb\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Parameter 'function'= of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.\n", "WARNING:datasets.fingerprint:Parameter 'function'= of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a3b0c0581f1f4c428baaadd8e9a39b6f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Map: 0%| | 0/401 [00:00 torch.Tensor:\n", " \"\"\"extract wav2vec2 features\"\"\"\n", "\n", " if wavs.shape[0] != 1:\n", "\n", " raise ValueError(f\"Expected batch size 1, but got shape {wavs.shape}\")\n", " wav_np = wavs.squeeze(0).cpu().numpy()\n", "\n", " processed = audio_tokenizer.processor(\n", " wav_np,\n", " sampling_rate=16000,\n", " return_tensors=\"pt\",\n", " padding=True,\n", " )\n", " input_values = processed.input_values\n", "\n", " input_values = input_values.to(audio_tokenizer.feature_extractor.device)\n", "\n", " model_output = audio_tokenizer.feature_extractor(\n", " input_values,\n", " )\n", "\n", "\n", " if model_output.hidden_states is None:\n", " raise ValueError(\"Wav2Vec2Model did not return hidden states. Ensure config `output_hidden_states=True`.\")\n", "\n", " num_layers = len(model_output.hidden_states)\n", " required_layers = [11, 14, 16]\n", " if any(l >= num_layers for l in required_layers):\n", " raise IndexError(f\"Requested hidden state indices {required_layers} out of range for model with {num_layers} layers.\")\n", "\n", " feats_mix = (\n", " model_output.hidden_states[11] + model_output.hidden_states[14] + model_output.hidden_states[16]\n", " ) / 3\n", "\n", " return feats_mix\n", "def formatting_audio_func(example):\n", " text = f\"{example['source']}: {example['text']}\" if \"source\" in example else example[\"text\"]\n", " audio_array = example[\"audio\"][\"array\"]\n", " sampling_rate = example[\"audio\"][\"sampling_rate\"]\n", "\n", " target_sr = audio_tokenizer.config['sample_rate']\n", "\n", " if sampling_rate != target_sr:\n", " resampler = T.Resample(orig_freq=sampling_rate, new_freq=target_sr)\n", " audio_tensor_temp = torch.from_numpy(audio_array).float()\n", " audio_array = resampler(audio_tensor_temp).numpy()\n", "\n", " if audio_tokenizer.config[\"volume_normalize\"]:\n", " audio_array = audio_volume_normalize(audio_array)\n", "\n", " ref_wav_np = audio_tokenizer.get_ref_clip(audio_array)\n", "\n", " audio_tensor = torch.from_numpy(audio_array).unsqueeze(0).float().to(audio_tokenizer.device)\n", " ref_wav_tensor = torch.from_numpy(ref_wav_np).unsqueeze(0).float().to(audio_tokenizer.device)\n", "\n", "\n", " feat = extract_wav2vec2_features(audio_tensor)\n", "\n", " batch = {\n", "\n", " \"wav\": audio_tensor,\n", " \"ref_wav\": ref_wav_tensor,\n", " \"feat\": feat.to(audio_tokenizer.device),\n", " }\n", "\n", "\n", " semantic_token_ids, global_token_ids = audio_tokenizer.model.tokenize(batch)\n", "\n", " global_tokens = \"\".join(\n", " [f\"<|bicodec_global_{i}|>\" for i in global_token_ids.squeeze().cpu().numpy()] # Squeeze batch dim\n", " )\n", " semantic_tokens = \"\".join(\n", " [f\"<|bicodec_semantic_{i}|>\" for i in semantic_token_ids.squeeze().cpu().numpy()] # Squeeze batch dim\n", " )\n", "\n", " inputs = [\n", " \"<|task_tts|>\",\n", " \"<|start_content|>\",\n", " text,\n", " \"<|end_content|>\",\n", " \"<|start_global_token|>\",\n", " global_tokens,\n", " \"<|end_global_token|>\",\n", " \"<|start_semantic_token|>\",\n", " semantic_tokens,\n", " \"<|end_semantic_token|>\",\n", " \"<|im_end|>\"\n", " ]\n", " inputs = \"\".join(inputs)\n", " return {\"text\": inputs}\n", "\n", "\n", "dataset = dataset.map(formatting_audio_func, remove_columns=[\"audio\"])\n", "print(\"Moving Bicodec model and Wav2Vec2Model to cpu.\")\n", "audio_tokenizer.model.cpu()\n", "audio_tokenizer.feature_extractor.cpu()\n", "torch.cuda.empty_cache()" ] }, { "cell_type": "markdown", "metadata": { "id": "idAEIeSQ3xdS" }, "source": [ "\n", "### Train the model\n", "Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2025-03-22T00:34:09.688959Z", "iopub.status.busy": "2025-03-22T00:34:09.688649Z", "iopub.status.idle": "2025-03-22T00:34:09.729661Z", "shell.execute_reply": "2025-03-22T00:34:09.729001Z", "shell.execute_reply.started": "2025-03-22T00:34:09.688939Z" }, "id": "95_Nn-89DhsL" }, "outputs": [], "source": [ "from trl import SFTConfig, SFTTrainer\n", "trainer = SFTTrainer(\n", " model = model,\n", " tokenizer = tokenizer,\n", " train_dataset = dataset,\n", " dataset_text_field = \"text\",\n", " max_seq_length = max_seq_length,\n", " packing = False, # Can make training 5x faster for short sequences.\n", " args = SFTConfig(\n", " per_device_train_batch_size = 2,\n", " gradient_accumulation_steps = 4,\n", " warmup_steps = 5,\n", " num_train_epochs = 5, # Set this for 1 full training run.\n", " #max_steps = 60,\n", " learning_rate = 2e-4,\n", " fp16 = False, # We're doing full float32 s disable mixed precision\n", " bf16 = False, # We're doing full float32 s disable mixed precision\n", " logging_steps = 1,\n", " optim = \"adamw_8bit\",\n", " weight_decay = 0.01,\n", " lr_scheduler_type = \"linear\",\n", " seed = 3407,\n", " output_dir = \"outputs\",\n", " report_to = \"tensorboard\", # Use this for WandB etc\n", " ),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2ejIt2xSNKKp" }, "outputs": [], "source": [ "# @title Show current memory stats\n", "gpu_stats = torch.cuda.get_device_properties(0)\n", "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n", "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n", "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n", "print(f\"{start_gpu_memory} GB of memory reserved.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2025-03-22T00:34:12.049152Z", "iopub.status.busy": "2025-03-22T00:34:12.048862Z", "iopub.status.idle": "2025-03-22T00:34:14.404349Z", "shell.execute_reply": "2025-03-22T00:34:14.403239Z", "shell.execute_reply.started": "2025-03-22T00:34:12.049130Z" }, "id": "yqxqAZ7KJ4oL" }, "outputs": [], "source": [ "trainer_stats = trainer.train()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "pCqnaKmlO1U9" }, "outputs": [], "source": [ "# @title Show final memory and time stats\n", "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n", "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n", "used_percentage = round(used_memory / max_memory * 100, 3)\n", "lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n", "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n", "print(\n", " f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\"\n", ")\n", "print(f\"Peak reserved memory = {used_memory} GB.\")\n", "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n", "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n", "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "ekOmTR1hSNcr" }, "source": [ "\n", "### Inference\n", "Let's run the model! You can change the prompts\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "apUdB40Ep6Ki" }, "outputs": [], "source": [ "input_text = \"Hey there my name is Elise, and I'm a speech generation model that can sound like a person.\"\n", "\n", "chosen_voice = None # None for single-speaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2025-03-22T00:52:35.040842Z", "iopub.status.busy": "2025-03-22T00:52:35.040125Z", "iopub.status.idle": "2025-03-22T00:52:35.050560Z", "shell.execute_reply": "2025-03-22T00:52:35.049663Z", "shell.execute_reply.started": "2025-03-22T00:52:35.040818Z" }, "id": "krYI8PrRJ6MX" }, "outputs": [], "source": [ "#@title Run Inference\n", "\n", "import torch\n", "import re\n", "import numpy as np\n", "from typing import Dict, Any\n", "import torchaudio.transforms as T\n", "\n", "FastModel.for_inference(model) # Enable native 2x faster inference\n", "\n", "@torch.inference_mode()\n", "def generate_speech_from_text(\n", " text: str,\n", " temperature: float = 0.8, # Generation temperature\n", " top_k: int = 50, # Generation top_k\n", " top_p: float = 1, # Generation top_p\n", " max_new_audio_tokens: int = 2048, # Max tokens for audio part\n", " device: torch.device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", ") -> np.ndarray:\n", " \"\"\"\n", " Generates speech audio from text using default voice control parameters.\n", "\n", " Args:\n", " text (str): The text input to be converted to speech.\n", " temperature (float): Sampling temperature for generation.\n", " top_k (int): Top-k sampling parameter.\n", " top_p (float): Top-p (nucleus) sampling parameter.\n", " max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).\n", " device (torch.device): Device to run inference on.\n", "\n", " Returns:\n", " np.ndarray: Generated waveform as a NumPy array.\n", " \"\"\"\n", "\n", " torch.compiler.reset()\n", "\n", " prompt = \"\".join([\n", " \"<|task_tts|>\",\n", " \"<|start_content|>\",\n", " text,\n", " \"<|end_content|>\",\n", " \"<|start_global_token|>\"\n", " ])\n", "\n", " model_inputs = tokenizer([prompt], return_tensors=\"pt\").to(device)\n", "\n", " print(\"Generating token sequence...\")\n", " generated_ids = model.generate(\n", " **model_inputs,\n", " max_new_tokens=max_new_audio_tokens, # Limit generation length\n", " do_sample=True,\n", " temperature=temperature,\n", " top_k=top_k,\n", " top_p=top_p,\n", " eos_token_id=tokenizer.eos_token_id, # Stop token\n", " pad_token_id=tokenizer.pad_token_id # Use models pad token id\n", " )\n", " print(\"Token sequence generated.\")\n", "\n", "\n", " generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]\n", "\n", "\n", " predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]\n", " # print(f\"\\nGenerated Text (for parsing):\\n{predicts_text}\\n\") # Debugging\n", "\n", " # Extract semantic token IDs using regex\n", " semantic_matches = re.findall(r\"<\\|bicodec_semantic_(\\d+)\\|>\", predicts_text)\n", " if not semantic_matches:\n", " print(\"Warning: No semantic tokens found in the generated output.\")\n", " # Handle appropriately - perhaps return silence or raise error\n", " return np.array([], dtype=np.float32)\n", "\n", " pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim\n", "\n", " # Extract global token IDs using regex (assuming controllable mode also generates these)\n", " global_matches = re.findall(r\"<\\|bicodec_global_(\\d+)\\|>\", predicts_text)\n", " if not global_matches:\n", " print(\"Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.\")\n", " pred_global_ids = torch.zeros((1, 1), dtype=torch.long)\n", " else:\n", " pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim\n", "\n", " pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)\n", "\n", " print(f\"Found {pred_semantic_ids.shape[1]} semantic tokens.\")\n", " print(f\"Found {pred_global_ids.shape[2]} global tokens.\")\n", "\n", "\n", " # 5. Detokenize using BiCodecTokenizer\n", " print(\"Detokenizing audio tokens...\")\n", " # Ensure audio_tokenizer and its internal model are on the correct device\n", " audio_tokenizer.device = device\n", " audio_tokenizer.model.to(device)\n", " # Squeeze the extra dimension from global tokens as seen in SparkTTS example\n", " wav_np = audio_tokenizer.detokenize(\n", " pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)\n", " pred_semantic_ids.to(device) # Shape (1, N_semantic)\n", " )\n", " print(\"Detokenization complete.\")\n", "\n", " return wav_np\n", "\n", "if __name__ == \"__main__\":\n", " print(f\"Generating speech for: '{input_text}'\")\n", " text = f\"{chosen_voice}: \" + input_text if chosen_voice else input_text\n", " generated_waveform = generate_speech_from_text(input_text)\n", "\n", " if generated_waveform.size > 0:\n", " import soundfile as sf\n", " output_filename = \"generated_speech_controllable.wav\"\n", " sample_rate = audio_tokenizer.config.get(\"sample_rate\", 16000)\n", " sf.write(output_filename, generated_waveform, sample_rate)\n", " print(f\"Audio saved to {output_filename}\")\n", "\n", " # Optional: Play in notebook\n", " from IPython.display import Audio, display\n", " display(Audio(generated_waveform, rate=sample_rate))\n", " else:\n", " print(\"Audio generation failed (no tokens found?).\")" ] }, { "cell_type": "markdown", "metadata": { "id": "uMuVrWbjAzhc" }, "source": [ "\n", "### Saving, loading finetuned models\n", "To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.\n", "\n", "**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "upcOlWe7A1vc" }, "outputs": [], "source": [ "model.save_pretrained(\"lora_model\") # Local saving\n", "tokenizer.save_pretrained(\"lora_model\")\n", "# model.push_to_hub(\"your_name/lora_model\", token = \"...\") # Online saving\n", "# tokenizer.push_to_hub(\"your_name/lora_model\", token = \"...\") # Online saving" ] }, { "cell_type": "markdown", "metadata": { "id": "f422JgM9sdVT" }, "source": [ "\n", "### Saving to float16\n", "\n", "We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "iHjt_SMYsd3P", "outputId": "bd8cccb7-6b95-45bf-80da-de120988447e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.\n", "We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.\n", "To force `safe_serialization`, set it to `None` instead.\n", "Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded\n", "model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.\n", "Unsloth: Will remove a cached repo with size 15.1G\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Unsloth: Merging 4bit and LoRA weights to 16bit...\n", "Unsloth: Will use up to 3.99 out of 12.67 RAM for saving.\n", "Unsloth: Saving model... This might take 5 minutes ...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 28/28 [00:01<00:00, 27.83it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Unsloth: Saving tokenizer... Done.\n", "Unsloth: Saving model/pytorch_model-00001-of-00002.bin...\n", "Unsloth: Saving model/pytorch_model-00002-of-00002.bin...\n", "Done.\n" ] } ], "source": [ "# Merge to 16bit\n", "if False: model.save_pretrained_merged(\"model\", tokenizer, save_method = \"merged_16bit\",)\n", "if False: model.push_to_hub_merged(\"hf/model\", tokenizer, save_method = \"merged_16bit\", token = \"\")\n", "\n", "# Merge to 4bit\n", "if False: model.save_pretrained_merged(\"model\", tokenizer, save_method = \"merged_4bit\",)\n", "if False: model.push_to_hub_merged(\"hf/model\", tokenizer, save_method = \"merged_4bit\", token = \"\")\n", "\n", "# Just LoRA adapters\n", "if False:\n", " model.save_pretrained(\"model\")\n", " tokenizer.save_pretrained(\"model\")\n", "if False:\n", " model.push_to_hub(\"hf/model\", token = \"\")\n", " tokenizer.push_to_hub(\"hf/model\", token = \"\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "egOSE7Cgynx7" }, "source": [ "And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!\n", "\n", "Some other links:\n", "1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)\n", "2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)\n", "3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)\n", "6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!\n", "\n", "
\n", " \n", " \n", " \n", "\n", " Join Discord if you need help + ⭐️ Star us on Github ⭐️\n", "
\n" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kaggle": { "accelerator": "nvidiaTeslaT4", "dataSources": [], "dockerImageVersionId": 30919, "isGpuEnabled": true, "isInternetEnabled": true, "language": "python", "sourceType": "notebook" }, "kernelspec": { "display_name": "TTS_ft", "language": "python", "name": "tts_ft" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "0474debc340943bd85f3daf92aebf7aa": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_0de4d0f282404edfbc191dca73f15f35", "max": 401, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_e58b5ad2f781475d8af2ddb38009baa6", "value": 354 } }, "0de4d0f282404edfbc191dca73f15f35": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "2315228ff2b141afabe1263471f5364b": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_426eb100a94642f79e6b99777406a265", "placeholder": "​", "style": "IPY_MODEL_a36b5cf197dd4bd9a7f70aa6671b804c", "value": "Map:  88%" } }, "33fbacbb2aa146cd90586357eec1dc3e": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "426eb100a94642f79e6b99777406a265": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "930b4d1d5f4b494b830df4d4c398e67c": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "a36b5cf197dd4bd9a7f70aa6671b804c": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "a3b0c0581f1f4c428baaadd8e9a39b6f": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_2315228ff2b141afabe1263471f5364b", "IPY_MODEL_0474debc340943bd85f3daf92aebf7aa", "IPY_MODEL_cff1b0fa2ea24f45aab26685353eefdd" ], "layout": "IPY_MODEL_b7e20be79df246f19b35114a690e44f0" } }, "b7e20be79df246f19b35114a690e44f0": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "cff1b0fa2ea24f45aab26685353eefdd": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_33fbacbb2aa146cd90586357eec1dc3e", "placeholder": "​", "style": "IPY_MODEL_930b4d1d5f4b494b830df4d4c398e67c", "value": " 354/401 [03:01<00:22,  2.11 examples/s]" } }, "e58b5ad2f781475d8af2ddb38009baa6": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } } } } }, "nbformat": 4, "nbformat_minor": 4 }