language:
- en
- ta
license: cc-by-4.0
tags:
- translation
- tamil
- colloquial-tamil
- fine-tuned
- text-to-text
datasets:
- janisrebekahv/colloquial_tamil
- jarvisvasu/english-to-colloquial-tamil
- chatgpt-generated
- youtube-comments
model-index:
- name: janisrebekahv/finetuned-colloquial-tamil
results:
- task:
type: translation
name: English to Colloquial Tamil
dataset:
name: janisrebekahv/colloquial_tamil
type: text
metrics:
- name: BLEU Score
type: bleu
value: 38.5
- name: ROUGE Score
type: rouge
value: 0.72
janisrebekahv/finetuned-colloquial-tamil
π Model Overview
This is a fine-tuned version of suriya7/English-to-Tamil, trained to produce colloquial Tamil translations instead of formal Tamil.
β
Translates English β Colloquial Tamil
β
Incorporates slang, informal speech, and real-world phrasing
β
Useful for chatbots, conversational AI, and social media applications
π Dataset
πΉ Custom Dataset Used for Fine-Tuning:
π janisrebekahv/colloquial_tamil
This dataset was specifically curated to train this model, improving its ability to translate English to Colloquial Tamil accurately.
This model was fine-tuned on a custom dataset, which includes:
1οΈβ£ jarvisvasu/english-to-colloquial-tamil β A publicly available dataset for informal Tamil translations.
2οΈβ£ YouTube Comments Dataset (Custom-Created) β Extracted using the YouTube API and manually converted to colloquial Tamil for authenticity.
3οΈβ£ ChatGPT-Generated Data β Additional colloquial Tamil phrases aligned with natural speech patterns.
π Total dataset size: 16,269 sentence pairs
π₯ Example Usage
Load and test the model using Hugging Face Transformers:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Load model and tokenizer
model_name = "janisrebekahv/finetuned-colloquial-tamil"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Function to translate text
def translate(text):
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example translations
test_sentences = [
"This is so beautiful",
"Bro, are you coming or not?",
"My mom is gonna kill me if I don't reach home now!"
]
for sentence in test_sentences:
print(f"English: {sentence}")
print(f"Colloquial Tamil: {translate(sentence)}\n")