README / README.md
Nevidu's picture
Update README.md
52a6dfd verified
---
title: README
emoji: 🐨
colorFrom: yellow
colorTo: green
sdk: static
pinned: false
---
Shared Task Specification: “Small Models, Big Impact”
Building Compact Sinhala & Tamil LLMs (≤ 8 B Parameters)
1. Task Overview & Objectives
Goal: Foster development of compact, high-quality LLMs for Sinhala and Tamil by continual pre-training or fine-tuning open-source models with ≤ 8 billion parameters.
Impact: Empower local NLP research and applications—chatbots, translation, sentiment analysis, educational tools—while lowering computational and storage barriers.
Who Should Participate:
Students & Academic Teams: Showcase research on model adaptation, data augmentation, multilingual/multitask training.
Industry & Startups: Demonstrate practical performance in real-world pipelines; optimise inference speed, resource usage.
2. Allowed Base Models
Participants must choose one of the following (or any other fully open-source LLM ≤ 8 B params)
Note: Proprietary or closed-license models (e.g., GPT-3 series, Claude) are not allowed.
3. Data Resources and Evaluation
Training Data (public):
Sinhala: OSCAR‐Sinhala, Wikipedia dumps, Common Crawl subsets.
Tamil: OSCAR‐Tamil, Tamil Wikipedia, CC100‐Tamil.
Evaluation:
Your LLM will be evaluated using intrinsic and extrinsic measures as follows:
Intrinsic evaluation using Perplexity score
Extrinsic evaluation using the appropriate MMLU metric
You can use the given MMLU dataset and compare results in zero-shot, few-shot, and fine-tuned settings.
4. Submission Requirements
Model: HuggingFace-format upload.
Scripts and Notebooks: Should be uploaded to a GitHub or HuggingFace repository.
Technical Report (2-5 pages):
Training details: data sources, training mechanism, epochs, batch size, learning rates.
Resource usage: GPU time, list of hardware resources.
Model evaluation.
Analysis of strengths/limitations.
5. Rules & Fairness
Parameter Limit: Strict upper bound of 8 B parameters (model + adapter weights).
Data Usage: Only public/open-license data; no private or web-scraped behind login.
Reproducibility: All code, data-prep scripts, and logs must be publicly accessible by the submission deadline.
6. How to Register & Contact
Registration Form: https://forms.gle/edzfpopVvKkkF6cH8
Contact: [email protected]
Phone: 076 981 1289