Update README.md
Browse files
README.md
CHANGED
@@ -7,4 +7,63 @@ sdk: static
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
+
Shared Task Specification: “Small Models, Big Impact”
|
11 |
+
Building Compact Sinhala & Tamil LLMs (≤ 8 B Parameters)
|
12 |
+
|
13 |
+
1. Task Overview & Objectives
|
14 |
+
Goal: Foster development of compact, high-quality LLMs for Sinhala and Tamil by continual pre-training or fine-tuning open-source models with ≤ 8 billion parameters.
|
15 |
+
Impact: Empower local NLP research and applications—chatbots, translation, sentiment analysis, educational tools—while lowering computational and storage barriers.
|
16 |
+
Who Should Participate:
|
17 |
+
Students & Academic Teams: Showcase research on model adaptation, data augmentation, multilingual/multitask training.
|
18 |
+
Industry & Startups: Demonstrate practical performance in real-world pipelines; optimise inference speed, resource usage.
|
19 |
+
|
20 |
+
2. Allowed Base Models
|
21 |
+
Participants must choose one of the following (or any other fully open-source LLM ≤ 8 B params):
|
22 |
+
Model Name
|
23 |
+
Parameters
|
24 |
+
Notes
|
25 |
+
Llama 3
|
26 |
+
1B, 3B, 7B
|
27 |
+
Meta's Llama series, particularly the smaller versions, is designed for efficiency and multilingual text generation. While the larger Llama models are more widely known, the 1B and 3B models offer a compact solution. Meta has also shown interest in addressing the linguistic diversity gap, which includes support for languages like Sinhala and Tamil.
|
28 |
+
Gemma
|
29 |
+
2B, 4B
|
30 |
+
Developed by Google DeepMind, Gemma models are known for being lightweight yet powerful, with strong multilingual capabilities. Google has a strong focus on linguistic diversity, and Gemma's architecture makes it a good candidate for adapting to less-resourced languages.
|
31 |
+
Qwen-2
|
32 |
+
0.5B, 1.5B, 7B
|
33 |
+
This family of models from Alibaba is designed for efficiency and versatility. Their strong multilingual pretraining makes them good candidates for adaptation to Sinhala and Tamil through fine-tuning.
|
34 |
+
Microsoft Phi-3-Mini
|
35 |
+
3.8B
|
36 |
+
This model from Microsoft is highlighted for its strong reasoning and code generation capabilities within a compact size. While its primary focus isn't explicitly on a wide range of South Asian languages, its efficient design and good general language understanding could make it a suitable base for fine-tuning with Sinhala and Tamil data.
|
37 |
+
Or … any other open-source checkpoint ≤ 8 B params
|
38 |
+
|
39 |
+
Note: Proprietary or closed-license models (e.g., GPT-3 series, Claude) are not allowed.
|
40 |
+
|
41 |
+
3. Data Resources and Evaluation
|
42 |
+
Training Data (public):
|
43 |
+
Sinhala: OSCAR‐Sinhala, Wikipedia dumps, Common Crawl subsets.
|
44 |
+
Tamil: OSCAR‐Tamil, Tamil Wikipedia, CC100‐Tamil.
|
45 |
+
Evaluation:
|
46 |
+
Your LLM will be evaluated using intrinsic and extrinsic measures as follows:
|
47 |
+
Intrinsic evaluation using Perplexity score
|
48 |
+
Extrinsic evaluation using the appropriate MMLU metric
|
49 |
+
You can use the given MMLU dataset and compare results in zero-shot, few-shot, and fine-tuned settings.
|
50 |
+
|
51 |
+
4. Submission Requirements
|
52 |
+
Model: HuggingFace-format upload.
|
53 |
+
Scripts and Notebooks: Should be uploaded to a GitHub or HuggingFace repository.
|
54 |
+
Technical Report (2-5 pages):
|
55 |
+
Training details: data sources, training mechanism, epochs, batch size, learning rates.
|
56 |
+
Resource usage: GPU time, list of hardware resources.
|
57 |
+
Model evaluation.
|
58 |
+
Analysis of strengths/limitations.
|
59 |
+
|
60 |
+
|
61 |
+
5. Rules & Fairness
|
62 |
+
Parameter Limit: Strict upper bound of 8 B parameters (model + adapter weights).
|
63 |
+
Data Usage: Only public/open-license data; no private or web-scraped behind login.
|
64 |
+
Reproducibility: All code, data-prep scripts, and logs must be publicly accessible by the submission deadline.
|
65 |
+
|
66 |
+
6. How to Register & Contact
|
67 |
+
Registration Form: https://forms.gle/edzfpopVvKkkF6cH8
|
68 |
+
Contact: [email protected]
|
69 |
+
Phone: 076 981 1289
|