File size: 2,319 Bytes
942b8a8
 
 
 
 
 
 
 
 
5169989
 
 
 
 
 
 
 
 
 
 
52a6dfd
5169989
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
title: README
emoji: 🐨
colorFrom: yellow
colorTo: green
sdk: static
pinned: false
---

Shared Task Specification: “Small Models, Big Impact”
Building Compact Sinhala & Tamil LLMs (≤ 8 B Parameters)

1. Task Overview & Objectives
Goal: Foster development of compact, high-quality LLMs for Sinhala and Tamil by continual pre-training or fine-tuning open-source models with ≤ 8 billion parameters.
Impact: Empower local NLP research and applications—chatbots, translation, sentiment analysis, educational tools—while lowering computational and storage barriers.
Who Should Participate:
Students & Academic Teams: Showcase research on model adaptation, data augmentation, multilingual/multitask training.
Industry & Startups: Demonstrate practical performance in real-world pipelines; optimise inference speed, resource usage.

2. Allowed Base Models
Participants must choose one of the following (or any other fully open-source LLM ≤ 8 B params)
Note: Proprietary or closed-license models (e.g., GPT-3 series, Claude) are not allowed.

3. Data Resources and Evaluation
Training Data (public):
Sinhala: OSCAR‐Sinhala, Wikipedia dumps, Common Crawl subsets.
Tamil: OSCAR‐Tamil, Tamil Wikipedia, CC100‐Tamil.
Evaluation:
Your LLM will be evaluated using intrinsic and extrinsic measures as follows:
Intrinsic evaluation using Perplexity score
Extrinsic evaluation using the appropriate MMLU metric
You can use the given MMLU dataset and compare results in zero-shot, few-shot, and fine-tuned settings. 

4. Submission Requirements
Model: HuggingFace-format upload.
Scripts and Notebooks: Should be uploaded to a GitHub or HuggingFace repository.
Technical Report (2-5 pages):
Training details: data sources, training mechanism, epochs, batch size, learning rates.
Resource usage: GPU time, list of hardware resources.
Model evaluation.
Analysis of strengths/limitations.


5. Rules & Fairness
Parameter Limit: Strict upper bound of 8 B parameters (model + adapter weights).
Data Usage: Only public/open-license data; no private or web-scraped behind login.
Reproducibility: All code, data-prep scripts, and logs must be publicly accessible by the submission deadline.

6. How to Register & Contact
Registration Form: https://forms.gle/edzfpopVvKkkF6cH8
Contact: [email protected]
Phone: 076 981 1289