UmidMusaev (UmidMusaev)

commented on From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages 8 days ago

Hey! Thanks for the examples. There was a trouble with the dataset. I've chosen the small one for the training. Tried with bigger one, and its working now

Thanks a lot for the help! :)

commented on From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages 10 days ago

Hey, Steven! Thanks for the response!
I checked if everything is corresponding to the docs from your link, and it still generating the 1 second audio. Also, the amount of tokens are only 51
Can it be related to the model train issue? Maybe I'm missing something?

I've also notices, that after the create_dataset script run, in the output folder generated only two files generated: train_input_ids_shape.npy and train_input_ids.memmap, is it ok?

Here is the image of the model run with logs

Also, here is my test-model.py if it makes any sense

import torch
import soundfile as sf
import re
from transformers import AutoTokenizer
from liger_kernel.transformers import AutoLigerKernelForCausalLM
from xcodec2.modeling_xcodec2 import XCodec2Model

# Load your fine-tuned model from the "output" directory.
model_path = "output"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoLigerKernelForCausalLM.from_pretrained(
    model_path,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)
model.eval()
model.to("cuda")

# Load the vocoder (XCodec2) model.
codec_model_path = "HKUST-Audio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(codec_model_path)
Codec_model.eval().to("cuda")

# Input text to convert to speech.
input_text = "Salom! Shirin AI companiyaga hush kelibsiz! Qanday yordam beriolayman?"

def extract_speech_ids(speech_tokens_str_list):
    """
    Use regex to extract numerical speech IDs from strings like "<|s_62770|>".
    This version handles a list of strings output by tokenizer.batch_decode.
    """
    speech_ids = []
    pattern = r"<\|s_(\d+)\|>"
    for token_str in speech_tokens_str_list:
        matches = re.findall(pattern, token_str)
        if matches:
            speech_ids.extend([int(num) for num in matches])
        else:
            print(f"Unexpected token: {token_str}")
    return speech_ids

with torch.no_grad():
    # Format the text with the special tokens used during training.
    formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
    
    # Construct the chat prompt.
    chat = [
        {"role": "user", "content": "Convert the text to speech:" + formatted_text},
        {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
    ]
    
    # Tokenize the prompt using the chat template.
    input_ids = tokenizer.apply_chat_template(
        chat, 
        tokenize=True, 
        return_tensors="pt", 
        continue_final_message=True
    ).to("cuda")
    
    # Set the end-of-speech token.
    speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>")
    
    # Generate the speech autoregressively.
    outputs = model.generate(
        input_ids,
        max_length=2048,  # adjust as needed
        eos_token_id=speech_end_id,
        do_sample=True,
        top_p=1,
        temperature=0.8,
    )
    
    # Extract the generated tokens corresponding to speech (exclude the prompt).
    generated_ids = outputs[0][input_ids.shape[1]:-1]
    print("Generated token IDs:", generated_ids)
    
    # IMPORTANT: Decode without skipping special tokens so that markers like <|s_62770|> are preserved.
    speech_tokens_str = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
    print("Decoded speech tokens:", speech_tokens_str)
    
    # Convert the string tokens into integer speech IDs using regex extraction.
    speech_ids = extract_speech_ids(speech_tokens_str)
    print("Extracted speech IDs:", speech_ids)
    
    # Convert the list of speech IDs to a tensor with shape [1, 1, L].
    speech_ids_tensor = torch.tensor(speech_ids, device="cuda").unsqueeze(0).unsqueeze(0)
    print("Speech IDs tensor shape:", speech_ids_tensor.shape)
    
    # Use the vocoder to decode the speech tokens into a waveform.
    gen_wav = Codec_model.decode_code(speech_ids_tensor)

# Save the generated waveform to a WAV file.
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Saved generated speech to gen.wav")```

upvoted an article 10 days ago

Article

From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages

By

and 1 other •

Feb 11

• 26

commented on From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages 10 days ago

Hi! Thanks for your post!
Right now I'm trying to train the model to the Uzbek language, and I'm new to the LLMs and ML sphere. Currently, I'm facing the issue with running the model.

I've run both scripts with the corresponding dataset, and according to the folder size, it's been trained properly
So, the question is - how can I run this model now?
Any suggestions, links to understand what to search

Currently, on running the model I'm getting the result like this
"Decoded segment with markers: <|SPEECH_GENERATION_START|><|s_62770|><|s_63794|><|s_60710|><|s_43305|><|s_59942|><|s_15051|><|s_64054|><|s_62770|><|s_65078|><|s_61235|><|s_59702|><|s_55594|><|s_64822|><|s_59702|><|SPEECH_GENERATION_END|>
"

Thanks in advance!

liked a model 24 days ago

stepfun-ai/Step-Audio-TTS-3B

Text-to-Speech • Updated 25 days ago • 2.12k • 169

liked a model 3 months ago

ruliad/deepthought-8b-llama-v0.01-alpha

Text Generation • Updated Dec 7, 2024 • 119 • 144

UmidMusaev

AI & ML interests

Recent Activity

Organizations

UmidMusaev's activity

From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages

stepfun-ai/Step-Audio-TTS-3B

ruliad/deepthought-8b-llama-v0.01-alpha