Update README.md
Browse files
README.md
CHANGED
|
@@ -7,4 +7,63 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
Shared Task Specification: “Small Models, Big Impact”
|
| 11 |
+
Building Compact Sinhala & Tamil LLMs (≤ 8 B Parameters)
|
| 12 |
+
|
| 13 |
+
1. Task Overview & Objectives
|
| 14 |
+
Goal: Foster development of compact, high-quality LLMs for Sinhala and Tamil by continual pre-training or fine-tuning open-source models with ≤ 8 billion parameters.
|
| 15 |
+
Impact: Empower local NLP research and applications—chatbots, translation, sentiment analysis, educational tools—while lowering computational and storage barriers.
|
| 16 |
+
Who Should Participate:
|
| 17 |
+
Students & Academic Teams: Showcase research on model adaptation, data augmentation, multilingual/multitask training.
|
| 18 |
+
Industry & Startups: Demonstrate practical performance in real-world pipelines; optimise inference speed, resource usage.
|
| 19 |
+
|
| 20 |
+
2. Allowed Base Models
|
| 21 |
+
Participants must choose one of the following (or any other fully open-source LLM ≤ 8 B params):
|
| 22 |
+
Model Name
|
| 23 |
+
Parameters
|
| 24 |
+
Notes
|
| 25 |
+
Llama 3
|
| 26 |
+
1B, 3B, 7B
|
| 27 |
+
Meta's Llama series, particularly the smaller versions, is designed for efficiency and multilingual text generation. While the larger Llama models are more widely known, the 1B and 3B models offer a compact solution. Meta has also shown interest in addressing the linguistic diversity gap, which includes support for languages like Sinhala and Tamil.
|
| 28 |
+
Gemma
|
| 29 |
+
2B, 4B
|
| 30 |
+
Developed by Google DeepMind, Gemma models are known for being lightweight yet powerful, with strong multilingual capabilities. Google has a strong focus on linguistic diversity, and Gemma's architecture makes it a good candidate for adapting to less-resourced languages.
|
| 31 |
+
Qwen-2
|
| 32 |
+
0.5B, 1.5B, 7B
|
| 33 |
+
This family of models from Alibaba is designed for efficiency and versatility. Their strong multilingual pretraining makes them good candidates for adaptation to Sinhala and Tamil through fine-tuning.
|
| 34 |
+
Microsoft Phi-3-Mini
|
| 35 |
+
3.8B
|
| 36 |
+
This model from Microsoft is highlighted for its strong reasoning and code generation capabilities within a compact size. While its primary focus isn't explicitly on a wide range of South Asian languages, its efficient design and good general language understanding could make it a suitable base for fine-tuning with Sinhala and Tamil data.
|
| 37 |
+
Or … any other open-source checkpoint ≤ 8 B params
|
| 38 |
+
|
| 39 |
+
Note: Proprietary or closed-license models (e.g., GPT-3 series, Claude) are not allowed.
|
| 40 |
+
|
| 41 |
+
3. Data Resources and Evaluation
|
| 42 |
+
Training Data (public):
|
| 43 |
+
Sinhala: OSCAR‐Sinhala, Wikipedia dumps, Common Crawl subsets.
|
| 44 |
+
Tamil: OSCAR‐Tamil, Tamil Wikipedia, CC100‐Tamil.
|
| 45 |
+
Evaluation:
|
| 46 |
+
Your LLM will be evaluated using intrinsic and extrinsic measures as follows:
|
| 47 |
+
Intrinsic evaluation using Perplexity score
|
| 48 |
+
Extrinsic evaluation using the appropriate MMLU metric
|
| 49 |
+
You can use the given MMLU dataset and compare results in zero-shot, few-shot, and fine-tuned settings.
|
| 50 |
+
|
| 51 |
+
4. Submission Requirements
|
| 52 |
+
Model: HuggingFace-format upload.
|
| 53 |
+
Scripts and Notebooks: Should be uploaded to a GitHub or HuggingFace repository.
|
| 54 |
+
Technical Report (2-5 pages):
|
| 55 |
+
Training details: data sources, training mechanism, epochs, batch size, learning rates.
|
| 56 |
+
Resource usage: GPU time, list of hardware resources.
|
| 57 |
+
Model evaluation.
|
| 58 |
+
Analysis of strengths/limitations.
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
5. Rules & Fairness
|
| 62 |
+
Parameter Limit: Strict upper bound of 8 B parameters (model + adapter weights).
|
| 63 |
+
Data Usage: Only public/open-license data; no private or web-scraped behind login.
|
| 64 |
+
Reproducibility: All code, data-prep scripts, and logs must be publicly accessible by the submission deadline.
|
| 65 |
+
|
| 66 |
+
6. How to Register & Contact
|
| 67 |
+
Registration Form: https://forms.gle/edzfpopVvKkkF6cH8
|
| 68 |
+
Contact: [email protected]
|
| 69 |
+
Phone: 076 981 1289
|