|
--- |
|
title: README |
|
emoji: 🐨 |
|
colorFrom: yellow |
|
colorTo: green |
|
sdk: static |
|
pinned: false |
|
--- |
|
|
|
Shared Task Specification: “Small Models, Big Impact” |
|
Building Compact Sinhala & Tamil LLMs (≤ 8 B Parameters) |
|
|
|
1. Task Overview & Objectives |
|
Goal: Foster development of compact, high-quality LLMs for Sinhala and Tamil by continual pre-training or fine-tuning open-source models with ≤ 8 billion parameters. |
|
Impact: Empower local NLP research and applications—chatbots, translation, sentiment analysis, educational tools—while lowering computational and storage barriers. |
|
Who Should Participate: |
|
Students & Academic Teams: Showcase research on model adaptation, data augmentation, multilingual/multitask training. |
|
Industry & Startups: Demonstrate practical performance in real-world pipelines; optimise inference speed, resource usage. |
|
|
|
2. Allowed Base Models |
|
Participants must choose one of the following (or any other fully open-source LLM ≤ 8 B params) |
|
Note: Proprietary or closed-license models (e.g., GPT-3 series, Claude) are not allowed. |
|
|
|
3. Data Resources and Evaluation |
|
Training Data (public): |
|
Sinhala: OSCAR‐Sinhala, Wikipedia dumps, Common Crawl subsets. |
|
Tamil: OSCAR‐Tamil, Tamil Wikipedia, CC100‐Tamil. |
|
Evaluation: |
|
Your LLM will be evaluated using intrinsic and extrinsic measures as follows: |
|
Intrinsic evaluation using Perplexity score |
|
Extrinsic evaluation using the appropriate MMLU metric |
|
You can use the given MMLU dataset and compare results in zero-shot, few-shot, and fine-tuned settings. |
|
|
|
4. Submission Requirements |
|
Model: HuggingFace-format upload. |
|
Scripts and Notebooks: Should be uploaded to a GitHub or HuggingFace repository. |
|
Technical Report (2-5 pages): |
|
Training details: data sources, training mechanism, epochs, batch size, learning rates. |
|
Resource usage: GPU time, list of hardware resources. |
|
Model evaluation. |
|
Analysis of strengths/limitations. |
|
|
|
|
|
5. Rules & Fairness |
|
Parameter Limit: Strict upper bound of 8 B parameters (model + adapter weights). |
|
Data Usage: Only public/open-license data; no private or web-scraped behind login. |
|
Reproducibility: All code, data-prep scripts, and logs must be publicly accessible by the submission deadline. |
|
|
|
6. How to Register & Contact |
|
Registration Form: https://forms.gle/edzfpopVvKkkF6cH8 |
|
Contact: [email protected] |
|
Phone: 076 981 1289 |
|
|