janisrebekahv's picture
Update README.md
0c0eee5 verified
metadata
language:
  - en
  - ta
license: cc-by-4.0
tags:
  - translation
  - tamil
  - colloquial-tamil
  - fine-tuned
  - text-to-text
datasets:
  - janisrebekahv/colloquial_tamil
  - jarvisvasu/english-to-colloquial-tamil
  - chatgpt-generated
  - youtube-comments
model-index:
  - name: janisrebekahv/finetuned-colloquial-tamil
    results:
      - task:
          type: translation
          name: English to Colloquial Tamil
        dataset:
          name: janisrebekahv/colloquial_tamil
          type: text
        metrics:
          - name: BLEU Score
            type: bleu
            value: 38.5
          - name: ROUGE Score
            type: rouge
            value: 0.72

janisrebekahv/finetuned-colloquial-tamil

πŸ“Œ Model Overview

This is a fine-tuned version of suriya7/English-to-Tamil, trained to produce colloquial Tamil translations instead of formal Tamil.

βœ… Translates English β†’ Colloquial Tamil
βœ… Incorporates slang, informal speech, and real-world phrasing
βœ… Useful for chatbots, conversational AI, and social media applications


πŸ“œ Dataset

πŸ”Ή Custom Dataset Used for Fine-Tuning:
πŸ“‚ janisrebekahv/colloquial_tamil
This dataset was specifically curated to train this model, improving its ability to translate English to Colloquial Tamil accurately.
This model was fine-tuned on a custom dataset, which includes:

1️⃣ jarvisvasu/english-to-colloquial-tamil – A publicly available dataset for informal Tamil translations.
2️⃣ YouTube Comments Dataset (Custom-Created) – Extracted using the YouTube API and manually converted to colloquial Tamil for authenticity.
3️⃣ ChatGPT-Generated Data – Additional colloquial Tamil phrases aligned with natural speech patterns.

πŸ“ Total dataset size: 16,269 sentence pairs


πŸ”₯ Example Usage

Load and test the model using Hugging Face Transformers:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model and tokenizer
model_name = "janisrebekahv/finetuned-colloquial-tamil"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Function to translate text
def translate(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=128)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example translations
test_sentences = [
    "This is so beautiful",
    "Bro, are you coming or not?",
    "My mom is gonna kill me if I don't reach home now!"
]

for sentence in test_sentences:
    print(f"English: {sentence}")
    print(f"Colloquial Tamil: {translate(sentence)}\n")