Eli: A Bilingual Hindi-English Large Language Model
Introduction
Eli is an innovative, open-source bilingual Hindi-English Large Language Model (LLM) designed to bridge the linguistic gap between Hindi and English. Developed with meticulous attention to detail, Eli represents a pioneering effort to broaden the scope of LLMs to diverse languages.
Purpose Behind Eli
Why We Built Eli:
- Language Adaptation: Enhance language adaptability within LLMs for Hindi and English.
- Efficient Training: Train and finetune on a compact dataset of 1 billion tokens.
- Optimized Processes: Identify and implement the most efficient training processes.
- World Knowledge Acquisition: Observe how the model acquires and processes world knowledge.
- Training Method Optimization: Optimize training methods tailored to each development stage.
Development Stages
Pre-training
- Objective: Familiarize Eli with a newly enriched vocabulary.
- Method: Full-weight pre-training on a 500-million-token corpus using 2xA100 GPUs, taking about 25 hours.
- Outcome: Improved Hindi token prediction and generation capabilities.
Bilingual Next Token Prediction and Translation
- Inspired By: The open Hathi series by Sarvam.ai.
- Dataset: 200,000 tokens, with translation using IndicTrans2.
- Method: Alternating sentences between Hindi and English for enhanced alignment and balanced exposure.
Bilingual Instruct Fine-tuning
- Objective: Enhance model responsiveness in both English and Hindi.
- Method: Supervised fine-tuning with low-rank adaptation using various instruction datasets.
- Outcome: A finely-tuned model available on Hugging Face, with a 4-bit quantized version for hands-on experience.
DPO Fine-tuning
- Objective: Refine model preferences using Direct Preference Optimization.
- Method: Translation and fine-tuning with the Anthropic/hh-rlhf dataset.
- Outcome: Ongoing comprehensive evaluation.
Learnings and Future Directions
Challenges:
- World Knowledge: Occasional hallucinations in response to specific queries.
- Translation: Requires more training data for nuanced translations.
- Fine-tuning: Future iterations will balance between full-weight and Lora fine-tuning based on further tests.
What's Next:
- Romanized Hindi: Incorporate Romanized Hindi for added linguistic versatility.
- Continuous Learning: Refine data pipelines, increase the training dataset to 10-15 billion Hindi tokens, and improve efficiency.
Generate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig, TextStreamer , TextIteratorStreamer
model = AutoModelForCausalLM.from_pretrained("Neohumans-ai/Eli", torch_dtype=torch.bfloat16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("Neohumans-ai/Eli", trust_remote_code=True)
# Existing messages list
messages = [
{"role": "system", "content": " You are Eli, an AI assistant created by NeoHumans-ai and trained on top of Llama 3 Large language model (LLM), proficient in English and Hindi. You can respond in both languages based on the user's request."},
{"role": "user", "content": "Who are you"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
# tokenize=False,
return_tensors="pt"
).to("cuda")
outputs = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"),
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
Multi-turn Chat
To use the Eli model, you can follow the example code below:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig, TextStreamer , TextIteratorStreamer
model = AutoModelForCausalLM.from_pretrained("Neohumans-ai/Eli", torch_dtype=torch.bfloat16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("Neohumans-ai/Eli", trust_remote_code=True)
# Existing messages list
messages = [
{"role": "system", "content": " You are Eli, an AI assistant created by NeoHumans-ai and trained on top of Llama 3 Large language model (LLM), proficient in English and Hindi. You can respond in both languages based on the user's request."},
]
# Function to add user input and generate response
def process_user_input(user_input):
global messages
# Add user's input to messages list
messages.append({"role": "user", "content": user_input})
# Prepare the prompt for generation
prompt_formatted_message = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False
)
# Configure generation parameters
generation_config = GenerationConfig(
repetition_penalty=1.2,
max_new_tokens=8000,
temperature=0.2,
top_p=0.95,
top_k=40,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"),
pad_token_id=tokenizer.pad_token_id,
do_sample=True,
use_cache=True,
return_dict_in_generate=True,
output_attentions=False,
output_hidden_states=False,
output_scores=False,
)
streamer = TextStreamer(tokenizer)
batch = tokenizer(str(prompt_formatted_message.strip()), return_tensors="pt")
print("\033[32mResponse: \033[0m") # Print an empty response
# Generate response
generated = model.generate(
inputs=batch["input_ids"].to("cuda"),
generation_config=generation_config,
streamer=streamer,
)
# Extract and format assistant's response
# print(tokenizer.decode(generated["sequences"].cpu().tolist()[0]))
assistant_response = tokenizer.decode(generated["sequences"].cpu().tolist()[0])
# Find the last occurrence of "assistant" and empty string ("")
assistant_start_index = assistant_response.rfind("<|start_header_id|>assistant<|end_header_id|>")
empty_string_index = assistant_response.rfind("<|eot_id|>")
# Extract the text between the last "assistant" and ""
if assistant_start_index != -1 and empty_string_index != -1:
final_response = assistant_response[assistant_start_index + len("<|start_header_id|>assistant<|end_header_id|>") : empty_string_index]
else:
# final_response = assistant_response # If indices not found, use the whole response
assert "Filed to generate multi turn prompt formate"
# Append the extracted response to the messages list
messages.append({"role": "assistant", "content": final_response})
# messages.append({"role": "assistant", "content": assistant_response})
# Print assistant's response
# print(f"Assistant: {assistant_response}")
# Main interaction loop
while True:
print("=================================================================================")
user_input = input("Input: ") # Prompt user for input
# Check if user_input is empty
if not user_input.strip(): # .strip() removes any leading or trailing whitespace
break # Break out of the loop if input is empty
# Print response placeholder
process_user_input(user_input) # Process user's input and generate response
Prompt formate
system prompt = You are Eli, an AI assistant created by NeoHumans-ai and trained on top of Llama 3 Large language model(LLM), proficient in English and Hindi. You can respond in both languages based on the users request.
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{ model_answer_1 }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ user_message_2 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Benchmarks
coming soon
Conclusion
Eli is designed to handle multi-turn chat conversations and understands Hinglish, making it highly effective for bilingual and code-mixed language contexts. Explore Eli’s capabilities on Hugging Face and experience the model firsthand on chat.cognitivelab.in.
Weights and datasets are available on Hugging Face:
Stay tuned for more updates as we continue to evolve and enrich Eli.
- Downloads last month
- 8