Spaces:

iCIIT
/

README

Running

App Files Files Community

Nevidu commited on Jul 21

Commit

5169989

verified ·

1 Parent(s): 942b8a8

Update README.md

Browse files

Files changed (1) hide show

README.md +60 -1

README.md CHANGED Viewed

@@ -7,4 +7,63 @@ sdk: static
 pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 pinned: false
 ---
+Shared Task Specification: “Small Models, Big Impact”
+Building Compact Sinhala & Tamil LLMs (≤ 8 B Parameters)
+1. Task Overview & Objectives
+Goal: Foster development of compact, high-quality LLMs for Sinhala and Tamil by continual pre-training or fine-tuning open-source models with ≤ 8 billion parameters.
+Impact: Empower local NLP research and applications—chatbots, translation, sentiment analysis, educational tools—while lowering computational and storage barriers.
+Who Should Participate:
+Students & Academic Teams: Showcase research on model adaptation, data augmentation, multilingual/multitask training.
+Industry & Startups: Demonstrate practical performance in real-world pipelines; optimise inference speed, resource usage.
+2. Allowed Base Models
+Participants must choose one of the following (or any other fully open-source LLM ≤ 8 B params):
+Model Name
+Parameters
+Notes
+Llama 3
+1B, 3B, 7B
+Meta's Llama series, particularly the smaller versions, is designed for efficiency and multilingual text generation. While the larger Llama models are more widely known, the 1B and 3B models offer a compact solution. Meta has also shown interest in addressing the linguistic diversity gap, which includes support for languages like Sinhala and Tamil.
+Gemma
+2B, 4B
+Developed by Google DeepMind, Gemma models are known for being lightweight yet powerful, with strong multilingual capabilities. Google has a strong focus on linguistic diversity, and Gemma's architecture makes it a good candidate for adapting to less-resourced languages.
+Qwen-2
+0.5B, 1.5B, 7B
+This family of models from Alibaba is designed for efficiency and versatility. Their strong multilingual pretraining makes them good candidates for adaptation to Sinhala and Tamil through fine-tuning.
+Microsoft Phi-3-Mini
+3.8B
+This model from Microsoft is highlighted for its strong reasoning and code generation capabilities within a compact size. While its primary focus isn't explicitly on a wide range of South Asian languages, its efficient design and good general language understanding could make it a suitable base for fine-tuning with Sinhala and Tamil data.
+Or … any other open-source checkpoint ≤ 8 B params
+Note: Proprietary or closed-license models (e.g., GPT-3 series, Claude) are not allowed.
+3. Data Resources and Evaluation
+Training Data (public):
+Sinhala: OSCAR‐Sinhala, Wikipedia dumps, Common Crawl subsets.
+Tamil: OSCAR‐Tamil, Tamil Wikipedia, CC100‐Tamil.
+Evaluation:
+Your LLM will be evaluated using intrinsic and extrinsic measures as follows:
+Intrinsic evaluation using Perplexity score
+Extrinsic evaluation using the appropriate MMLU metric
+You can use the given MMLU dataset and compare results in zero-shot, few-shot, and fine-tuned settings.
+4. Submission Requirements
+Model: HuggingFace-format upload.
+Scripts and Notebooks: Should be uploaded to a GitHub or HuggingFace repository.
+Technical Report (2-5 pages):
+Training details: data sources, training mechanism, epochs, batch size, learning rates.
+Resource usage: GPU time, list of hardware resources.
+Model evaluation.
+Analysis of strengths/limitations.
+5. Rules & Fairness
+Parameter Limit: Strict upper bound of 8 B parameters (model + adapter weights).
+Data Usage: Only public/open-license data; no private or web-scraped behind login.
+Reproducibility: All code, data-prep scripts, and logs must be publicly accessible by the submission deadline.
+6. How to Register & Contact
+Registration Form: https://forms.gle/edzfpopVvKkkF6cH8
+Contact: [email protected]
+Phone: 076 981 1289