--- library_name: transformers language: - yo - ig - ha base_model: - HuggingFaceTB/SmolLM2-360M - saheedniyi/YarnGPT pipeline_tag: text-to-speech license: cc-by-nc-sa-4.0 --- # YarnGPT-local ![image/png](https://huggingface.co/saheedniyi/YarnGPT/resolve/main/audio/logo.webp) ## Table of Contents 1. [Model Summary](#model-summary) 2. [Model Description](#model-description) 3. [Bias, Risks, and Limitations](#bias-risks-and-limitations) - [Recommendations](#recommendations) 4. [Speech Samples](#speech-samples) 5. [Training](#training) 6. [Future Improvements](#future-improvements) 7. [Citation](#citation) 8. [Credits & References](#credits--references) ## Model Summary YarnGPT-local is a text-to-speech (TTS) model designed to synthesize Yoruba, Igbo and Hausa leveraging pure language modelling without external adapters or complex architectures, offering high-quality, natural, and culturally relevant speech synthesis for diverse applications. #### How to use (on Google Colab) The model can generate audio on its own but its better to use a voice to prompt the model, there are about 10 voices supported by default: - hausa_female1 - hausa_female2 - hausa_male1 - hausa_male2 - igbo_female1 - igbo_female2 - igbo_male2 - yoruba_female1 - yoruba_female2 - yoruba_male2 ### Prompt YarnGPT-local ```python # clone the YarnGPT repo to get access to the `audiotokenizer` !git clone https://github.com/saheedniyi02/yarngpt.git # install some necessary libraries !pip install outetts==0.2.3 uroman #import some important packages import os import re import json import torch import inflect import random import uroman as ur import numpy as np import torchaudio import IPython from transformers import AutoModelForCausalLM, AutoTokenizer from outetts.wav_tokenizer.decoder import WavTokenizer from yarngpt.audiotokenizer import AudioTokenizerForLocal # download the wavtokenizer weights and config (to encode and decode the audio) !wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml !wget https://huggingface.co/novateur/WavTokenizer-large-speech-75token/resolve/main/wavtokenizer_large_speech_320_24k.ckpt # model path and wavtokenizer weight path (the paths are assumed based on Google colab, a different environment might save the weights to a different location). hf_path="saheedniyi/YarnGPT-local" wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml" wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt" # create the AudioTokenizer object audio_tokenizer=AudioTokenizerForLocal( hf_path,wav_tokenizer_model_path,wav_tokenizer_config_path ) #load the model weights model = AutoModelForCausalLM.from_pretrained(hf_path,torch_dtype="auto").to(audio_tokenizer.device) # your input text text="Ẹ maa rii pe lati bi ọsẹ meloo kan ni ijiroro ti wa lati ọdọ awọn ileeṣẹ wọnyi wi pe wọn fẹẹ ṣafikun si owo ipe pẹlu ida ọgọrun-un." # creating a prompt, when creating a prompt, there is an optional `speaker_name` parameter prompt=audio_tokenizer.create_prompt(text,"yoruba","yoruba_male2") # tokenize the prompt input_ids=audio_tokenizer.tokenize_prompt(prompt) # generate output from the model, you can tune the `.generate` parameters as you wish output = model.generate( input_ids=input_ids, temperature=0.1, repetition_penalty=1.1, num_beams=4, max_length=4000, ) # convert the output to "audio codes" codes=audio_tokenizer.get_codes(output) # converts the codes to audio audio=audio_tokenizer.get_audio(codes) # play the audio IPython.display.Audio(audio,rate=24000) # save the audio torchaudio.save(f"audio.wav", audio, sample_rate=24000) ``` ### Simple News-Reader for Local languages ```python # clone the YarnGPT repo to get access to the `audiotokenizer` !git clone https://github.com/saheedniyi02/yarngpt.git # install some necessary libraries !pip install outetts uroman trafilatura pydub #import important packages import os import re import json import torch import inflect import random import requests import trafilatura import inflect import uroman as ur import numpy as np import torchaudio import IPython from pydub import AudioSegment from pydub.effects import normalize from transformers import AutoModelForCausalLM, AutoTokenizer from outetts.wav_tokenizer.decoder import WavTokenizer from yarngpt.audiotokenizer import AudioTokenizer,AudioTokenizerForLocal # download the `WavTokenizer` files !wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml !wget https://huggingface.co/novateur/WavTokenizer-large-speech-75token/resolve/main/wavtokenizer_large_speech_320_24k.ckpt tokenizer_path="saheedniyi/YarnGPT-local" wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml" wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt" audio_tokenizer=AudioTokenizerForLocal( tokenizer_path,wav_tokenizer_model_path,wav_tokenizer_config_path ) model = AutoModelForCausalLM.from_pretrained(tokenizer_path,torch_dtype="auto").to(audio_tokenizer.device) # Split text into chunks def split_text_into_chunks(text, word_limit=25): sentences=[sentence.strip() for sentence in text.split('.') if sentence.strip()] chunks=[] for sentence in sentences: chunks.append(".") sentence_splitted=sentence.split(" ") num_words=len(sentence_splitted) start_index=0 if num_words>word_limit: while start_index Speech - **Finetuned from:** [HuggingFaceTB/SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) - **Repository:** [YarnGPT Github Repository](https://github.com/saheedniyi02/yarngpt) - **Paper:** IN PROGRESS. - **Demo:** 1) [Prompt YarnGPT-local notebook](https://colab.research.google.com/drive/1UWeirECQbjFGib1SqpiDdkzS1Bi_vi9i?usp=sharing) 2) [Simple news reader: YarnGPT-local](https://colab.research.google.com/drive/1CMsLVsDaX2u4YUtV01fOvnDCtCC59bNe?usp=sharing) #### Uses Generate yoruba, igbo and hausa speech for experimental purposes. #### Out-of-Scope Use The model is not suitable for generating speech in languages other than Yoruba, Igbo and Hausa. ## Bias, Risks, and Limitations - The model may not capture the full diversity of Nigerian accents and could exhibit biases based on the training dataset. - The audio generated by the model are sometimes very fast and might need some post-processing to be done. - The model doesn't take 'intonations' into account which sometimes leads to mispronounce meant of some words. - Model doesn't respond to some prompt #### Recommendations Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. Feedback and diverse training data contributions are encouraged. ## Speech Samples Listen to samples generated by YarnGPT:
Input Audio Notes
Ẹ maa rii pe lati bi ọsẹ meloo kan ni ijiroro ti wa lati ọdọ awọn ileeṣẹ wọnyi wi pe wọn fẹẹ ṣafikun si owo ipe pẹlu ida ọgọrun-un (temperature=0.1, repetition_penalty=1.1,num_beams=4), voice: yoruba_male2
Iwadii fihan pe ọkan lara awọn eeyan meji yii lo ṣee si ja sinu tanki epo disu naa lasiko to n ṣiṣẹ lọwọ. (temperature=0.1, repetition_penalty=1.1,num_beams=4), voice: yoruba_female1
Shirun da gwamnati mai ci yanzu ta yi wajen kin bayani a akan halin da ake ciki a game da batun kidayar shi ne ya janyo wannan zargi da jam'iyyar ta Labour ta yi. (temperature=0.1, repetition_penalty=1.1,num_beams=4), voice: hausa_male2
A lokuta da dama yakan fito a matsayin jarumin da ke taimaka wa babban jarumi, kodayake a wasu fina-finan yakan fito a matsayin babban jarumi. (temperature=0.1, repetition_penalty=1.1,num_beams=4), voice: hausa_female1
Amụma ndị ọzọ o buru gụnyere inweta ihe zuru oke, ịmụta ụmụaka nye ndị na-achọ nwa (temperature=0.1, repetition_penalty=1.1,num_beams=4), voice: igbo_female1
## Training #### Data Trained on open source dataset on Yoruba, Igbo and Hausa. #### Preprocessing Audio files were preprocessed and resampled to 24Khz and tokenized using [wavtokenizer](https://huggingface.co/novateur/WavTokenizer). #### Training Hyperparameters - **Number of epochs:** 5 - **batch_size:** 4 - **Scheduler:** linear schedule with warmup for 4 epochs, then linear decay to zero for the last epoch - **Optimizer:** AdamW (betas=(0.9, 0.95),weight_decay=0.01) - **Learning rate:** 1*10^-3 #### Hardware - **GPUs:** 1 A100 (google colab: 30 hours) #### Software - **Training Framework:** Pytorch ## Future Improvements? - Scaling up model size and training data - Wrap the model around an API endpoint - Voice cloning. - Potential expansion into speech-to-speech assistant models ## Citation [optional] #### BibTeX: ```python @misc{yarngpt2025, author = {Saheed Azeez}, title = {YarnGPT: Nigerian-Accented English Text-to-Speech Model}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/SaheedAzeez/yarngpt} } ``` #### APA: ```python Saheed Azeez. (2025). YarnGPT-local: Nigerian languages Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/saheedniyi/YarnGPT-local ``` ## Credits & References - [OuteAI/OuteTTS-0.2-500M](https://huggingface.co/OuteAI/OuteTTS-0.2-500M/) - [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) - [CTC Forced Alignment](https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html) - [Voicera](https://huggingface.co/Lwasinam/voicera)