Hey! Thanks for the examples. There was a trouble with the dataset. I've chosen the small one for the training. Tried with bigger one, and its working now
Thanks a lot for the help! :)
Hey! Thanks for the examples. There was a trouble with the dataset. I've chosen the small one for the training. Tried with bigger one, and its working now
Thanks a lot for the help! :)
Hey, Steven! Thanks for the response!
I checked if everything is corresponding to the docs from your link, and it still generating the 1 second audio. Also, the amount of tokens are only 51
Can it be related to the model train issue? Maybe I'm missing something?
I've also notices, that after the create_dataset script run, in the output folder generated only two files generated: train_input_ids_shape.npy and train_input_ids.memmap, is it ok?
Here is the image of the model run with logs
Also, here is my test-model.py if it makes any sense
import torch
import soundfile as sf
import re
from transformers import AutoTokenizer
from liger_kernel.transformers import AutoLigerKernelForCausalLM
from xcodec2.modeling_xcodec2 import XCodec2Model
# Load your fine-tuned model from the "output" directory.
model_path = "output"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoLigerKernelForCausalLM.from_pretrained(
model_path,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16
)
model.eval()
model.to("cuda")
# Load the vocoder (XCodec2) model.
codec_model_path = "HKUST-Audio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(codec_model_path)
Codec_model.eval().to("cuda")
# Input text to convert to speech.
input_text = "Salom! Shirin AI companiyaga hush kelibsiz! Qanday yordam beriolayman?"
def extract_speech_ids(speech_tokens_str_list):
"""
Use regex to extract numerical speech IDs from strings like "<|s_62770|>".
This version handles a list of strings output by tokenizer.batch_decode.
"""
speech_ids = []
pattern = r"<\|s_(\d+)\|>"
for token_str in speech_tokens_str_list:
matches = re.findall(pattern, token_str)
if matches:
speech_ids.extend([int(num) for num in matches])
else:
print(f"Unexpected token: {token_str}")
return speech_ids
with torch.no_grad():
# Format the text with the special tokens used during training.
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
# Construct the chat prompt.
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
# Tokenize the prompt using the chat template.
input_ids = tokenizer.apply_chat_template(
chat,
tokenize=True,
return_tensors="pt",
continue_final_message=True
).to("cuda")
# Set the end-of-speech token.
speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>")
# Generate the speech autoregressively.
outputs = model.generate(
input_ids,
max_length=2048, # adjust as needed
eos_token_id=speech_end_id,
do_sample=True,
top_p=1,
temperature=0.8,
)
# Extract the generated tokens corresponding to speech (exclude the prompt).
generated_ids = outputs[0][input_ids.shape[1]:-1]
print("Generated token IDs:", generated_ids)
# IMPORTANT: Decode without skipping special tokens so that markers like <|s_62770|> are preserved.
speech_tokens_str = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
print("Decoded speech tokens:", speech_tokens_str)
# Convert the string tokens into integer speech IDs using regex extraction.
speech_ids = extract_speech_ids(speech_tokens_str)
print("Extracted speech IDs:", speech_ids)
# Convert the list of speech IDs to a tensor with shape [1, 1, L].
speech_ids_tensor = torch.tensor(speech_ids, device="cuda").unsqueeze(0).unsqueeze(0)
print("Speech IDs tensor shape:", speech_ids_tensor.shape)
# Use the vocoder to decode the speech tokens into a waveform.
gen_wav = Codec_model.decode_code(speech_ids_tensor)
# Save the generated waveform to a WAV file.
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Saved generated speech to gen.wav")```
Hi! Thanks for your post!
Right now I'm trying to train the model to the Uzbek language, and I'm new to the LLMs and ML sphere. Currently, I'm facing the issue with running the model.
I've run both scripts with the corresponding dataset, and according to the folder size, it's been trained properly
So, the question is - how can I run this model now?
Any suggestions, links to understand what to search
Currently, on running the model I'm getting the result like this
"Decoded segment with markers: <|SPEECH_GENERATION_START|><|s_62770|><|s_63794|><|s_60710|><|s_43305|><|s_59942|><|s_15051|><|s_64054|><|s_62770|><|s_65078|><|s_61235|><|s_59702|><|s_55594|><|s_64822|><|s_59702|><|SPEECH_GENERATION_END|>
"
Thanks in advance!