File size: 2,319 Bytes
942b8a8 5169989 52a6dfd 5169989 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
---
title: README
emoji: 🐨
colorFrom: yellow
colorTo: green
sdk: static
pinned: false
---
Shared Task Specification: “Small Models, Big Impact”
Building Compact Sinhala & Tamil LLMs (≤ 8 B Parameters)
1. Task Overview & Objectives
Goal: Foster development of compact, high-quality LLMs for Sinhala and Tamil by continual pre-training or fine-tuning open-source models with ≤ 8 billion parameters.
Impact: Empower local NLP research and applications—chatbots, translation, sentiment analysis, educational tools—while lowering computational and storage barriers.
Who Should Participate:
Students & Academic Teams: Showcase research on model adaptation, data augmentation, multilingual/multitask training.
Industry & Startups: Demonstrate practical performance in real-world pipelines; optimise inference speed, resource usage.
2. Allowed Base Models
Participants must choose one of the following (or any other fully open-source LLM ≤ 8 B params)
Note: Proprietary or closed-license models (e.g., GPT-3 series, Claude) are not allowed.
3. Data Resources and Evaluation
Training Data (public):
Sinhala: OSCAR‐Sinhala, Wikipedia dumps, Common Crawl subsets.
Tamil: OSCAR‐Tamil, Tamil Wikipedia, CC100‐Tamil.
Evaluation:
Your LLM will be evaluated using intrinsic and extrinsic measures as follows:
Intrinsic evaluation using Perplexity score
Extrinsic evaluation using the appropriate MMLU metric
You can use the given MMLU dataset and compare results in zero-shot, few-shot, and fine-tuned settings.
4. Submission Requirements
Model: HuggingFace-format upload.
Scripts and Notebooks: Should be uploaded to a GitHub or HuggingFace repository.
Technical Report (2-5 pages):
Training details: data sources, training mechanism, epochs, batch size, learning rates.
Resource usage: GPU time, list of hardware resources.
Model evaluation.
Analysis of strengths/limitations.
5. Rules & Fairness
Parameter Limit: Strict upper bound of 8 B parameters (model + adapter weights).
Data Usage: Only public/open-license data; no private or web-scraped behind login.
Reproducibility: All code, data-prep scripts, and logs must be publicly accessible by the submission deadline.
6. How to Register & Contact
Registration Form: https://forms.gle/edzfpopVvKkkF6cH8
Contact: [email protected]
Phone: 076 981 1289
|