Nevidu commited on
Commit
5169989
·
verified ·
1 Parent(s): 942b8a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -1
README.md CHANGED
@@ -7,4 +7,63 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ Shared Task Specification: “Small Models, Big Impact”
11
+ Building Compact Sinhala & Tamil LLMs (≤ 8 B Parameters)
12
+
13
+ 1. Task Overview & Objectives
14
+ Goal: Foster development of compact, high-quality LLMs for Sinhala and Tamil by continual pre-training or fine-tuning open-source models with ≤ 8 billion parameters.
15
+ Impact: Empower local NLP research and applications—chatbots, translation, sentiment analysis, educational tools—while lowering computational and storage barriers.
16
+ Who Should Participate:
17
+ Students & Academic Teams: Showcase research on model adaptation, data augmentation, multilingual/multitask training.
18
+ Industry & Startups: Demonstrate practical performance in real-world pipelines; optimise inference speed, resource usage.
19
+
20
+ 2. Allowed Base Models
21
+ Participants must choose one of the following (or any other fully open-source LLM ≤ 8 B params):
22
+ Model Name
23
+ Parameters
24
+ Notes
25
+ Llama 3
26
+ 1B, 3B, 7B
27
+ Meta's Llama series, particularly the smaller versions, is designed for efficiency and multilingual text generation. While the larger Llama models are more widely known, the 1B and 3B models offer a compact solution. Meta has also shown interest in addressing the linguistic diversity gap, which includes support for languages like Sinhala and Tamil.
28
+ Gemma
29
+ 2B, 4B
30
+ Developed by Google DeepMind, Gemma models are known for being lightweight yet powerful, with strong multilingual capabilities. Google has a strong focus on linguistic diversity, and Gemma's architecture makes it a good candidate for adapting to less-resourced languages.
31
+ Qwen-2
32
+ 0.5B, 1.5B, 7B
33
+ This family of models from Alibaba is designed for efficiency and versatility. Their strong multilingual pretraining makes them good candidates for adaptation to Sinhala and Tamil through fine-tuning.
34
+ Microsoft Phi-3-Mini
35
+ 3.8B
36
+ This model from Microsoft is highlighted for its strong reasoning and code generation capabilities within a compact size. While its primary focus isn't explicitly on a wide range of South Asian languages, its efficient design and good general language understanding could make it a suitable base for fine-tuning with Sinhala and Tamil data.
37
+ Or … any other open-source checkpoint ≤ 8 B params
38
+
39
+ Note: Proprietary or closed-license models (e.g., GPT-3 series, Claude) are not allowed.
40
+
41
+ 3. Data Resources and Evaluation
42
+ Training Data (public):
43
+ Sinhala: OSCAR‐Sinhala, Wikipedia dumps, Common Crawl subsets.
44
+ Tamil: OSCAR‐Tamil, Tamil Wikipedia, CC100‐Tamil.
45
+ Evaluation:
46
+ Your LLM will be evaluated using intrinsic and extrinsic measures as follows:
47
+ Intrinsic evaluation using Perplexity score
48
+ Extrinsic evaluation using the appropriate MMLU metric
49
+ You can use the given MMLU dataset and compare results in zero-shot, few-shot, and fine-tuned settings.
50
+
51
+ 4. Submission Requirements
52
+ Model: HuggingFace-format upload.
53
+ Scripts and Notebooks: Should be uploaded to a GitHub or HuggingFace repository.
54
+ Technical Report (2-5 pages):
55
+ Training details: data sources, training mechanism, epochs, batch size, learning rates.
56
+ Resource usage: GPU time, list of hardware resources.
57
+ Model evaluation.
58
+ Analysis of strengths/limitations.
59
+
60
+
61
+ 5. Rules & Fairness
62
+ Parameter Limit: Strict upper bound of 8 B parameters (model + adapter weights).
63
+ Data Usage: Only public/open-license data; no private or web-scraped behind login.
64
+ Reproducibility: All code, data-prep scripts, and logs must be publicly accessible by the submission deadline.
65
+
66
+ 6. How to Register & Contact
67
+ Registration Form: https://forms.gle/edzfpopVvKkkF6cH8
68
+ Contact: [email protected]
69
+ Phone: 076 981 1289