PEFT
Safetensors
Sinhala

base_model: meta-llama/Meta-Llama-3-8B library_name: peft

Model Card for SinLlama

SinLlama is the first large language model specifically extended for Sinhala. It is based on Meta-Llama-3-8B and adapted through tokenizer vocabulary extension and continual pretraining on a 10M sentence Sinhala corpus. SinLlama significantly improves coverage and performance for Sinhala NLP tasks compared to base and instruct versions of Llama-3-8B.

DISCLAIMER: This is a base model, which has NOT been instruct-tuned. So you still need to do task-specific fine-tuning.

Model Details

Model Description

SinLlama is a decoder-based large language model designed to improve NLP performance for Sinhala, a low-resource Indo-Aryan language spoken by ~20 million people in Sri Lanka. The model was developed by enhancing the Llama-3-8B tokenizer with Sinhala-specific vocabulary and performing continual pretraining on a cleaned and diverse 10.7M-sentence Sinhala corpus.

Subsequent fine-tuning on Sinhala classification datasets (news categorization, sentiment analysis, and writing style classification) shows significant improvements over baseline Llama-3-8B models.

  • Developed by: H.W.K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Rishemjit Kaur, Surangika Ranathunga:contentReference[oaicite:1]{index=1}
  • Funded by: CSIR - Central Scientific Instruments Organization (India), Emojot (Pvt) Ltd:contentReference[oaicite:2]{index=2}
  • Shared by: Polyglots team
  • Model type: Decoder-only autoregressive transformer LLM
  • Language(s) (NLP): Sinhala (เทƒเท’เถ‚เท„เถฝ)
  • License: Same as base model (Meta Llama 3 license)
  • Finetuned from model: meta-llama/Meta-Llama-3-8B

Model Sources


SinLlama Model Creation

SinLlama Logo

Uses

Downstream Use

  • Instruction tuning for Sinhala dialogue systems, text classification, etc
  • Cross-lingual applications involving Sinhala
  • Educational and research applications in low-resource NLP

Out-of-Scope Use

  • Applications requiring high accuracy in non-Sinhala languages (performance may degrade due to adaptation focus on Sinhala)
  • Sensitive domains (e.g., healthcare, legal) without rigorous validation
  • Malicious generation (hate speech, disinformation)

Bias, Risks, and Limitations

  • Bias: Sinhala corpora may reflect sociocultural biases (e.g., political, gender, religious biases).
  • Limitations: Model may underperform in complex reasoning tasks or in languages other than Sinhala. Writing-style classification is observed as particularly challenging.
  • Risk: Misuse in spreading misinformation or biased outputs in Sinhala.

Recommendations

Users should carefully evaluate outputs before deployment, especially in sensitive or safety-critical applications. Fine-tuning with task/domain-specific Sinhala data is required for robustness.


How to Get Started with the Model

Install dependencies

!pip install unsloth
!pip install datasets==2.21.0
!pip install pandas==2.1.4

Import dependencies

from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import TextStreamer, AutoTokenizer
import torch
from datasets import load_dataset, DatasetDict, concatenate_datasets, Dataset
from collections import Counter, defaultdict
import os
import sys

from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
import pandas as pd

Load the base model

model_config = {"model_name": "unsloth/llama-3-8b", "load_in_4bit": False}
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
model_name = "polyglots/SinLlama_v01"

Load the model

model, _ = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    resize_model_vocab=139336 # Size of new vocab
)

Load our extended tokenizer

tokenizer = AutoTokenizer.from_pretrained("polyglots/Extended-Sinhala-LLaMA")
model.resize_token_embeddings(len(tokenizer))

Training Details

Training Data

  • Pretraining: 10.7M Sinhala sentences (303.9M tokens) from MADLAD-400 and CulturaX, filtered for quality and cleaned:contentReference[oaicite:0]{index=0}.
  • Fine-tuning:
    • Sentiment Analysis (~12.5K samples)
    • Writing Style Classification (~9K samples)
    • Sinhala News Category Classification (~3.3K samples)

Training Procedure

  • Tokenizer: Extended Llama-3 tokenizer with Sinhala-specific tokens using tiktoken.
  • Continual Pretraining: Using codebase from Chinese-Llama, block size reduced from 1024 โ†’ 512 for GPU compatibility.
  • Fine-tuning: LoRA-based parameter-efficient finetuning with Alpaca-style prompts.

Training Hyperparameters

  • Mixed precision (fp16/bf16) training
  • LoRA adapters for efficient fine-tuning

Evaluation

Testing Data

  • Sinhala sentiment, writing style, and news categorization datasets.
  • Splits: 80/10/10 with stratified sampling.

Metrics

  • Precision, Recall, F1-score

Results

Model Writing Style F1 News F1 Sentiment F1
Llama-3-8B base 24.50 19.03 36.29
Llama-3-8B base finetuned 49.45 61.14 59.35
Llama-3-8B instruct finetuned 42.25 47.81 68.78
SinLlama finetuned 58.89 86.40 72.47

Summary: SinLlama outperforms both base and instruct Llama-3-8B when fine-tuned, especially in news categorization and sentiment tasks:contentReference[oaicite:1]{index=1}.


Environmental Impact

  • Hardware Type: GPUs (not specified, likely A100-class)
  • Hours used: Not reported
  • Cloud Provider: CSIR & Emojot infrastructure:contentReference[oaicite:2]{index=2}
  • Compute Region: India & Sri Lanka
  • Carbon Emitted: Not reported

Technical Specifications

Model Architecture and Objective

  • Decoder-only transformer (Llama-3-8B backbone)
  • Autoregressive pretraining objective
  • Sinhala vocabulary-extended tokenizer

Compute Infrastructure

  • Hardware: GPUs provided by CSIR-CSIO and Emojot:contentReference[oaicite:3]{index=3}
  • Software: Hugging Face transformers, PEFT, LoRA, tiktoken

Citation

BibTeX:

@article{aravinda2025sinllama,
  title={SinLlama-A Large Language Model for Sinhala},
  author={Aravinda, H W K and Sirajudeen, Rashad and Karunathilake, Samith and de Silva, Nisansa and Ranathunga, Surangika and Kaur, Rishemjit},
  journal={arXiv preprint arXiv:2508.09115},
  year={2025}
}

APA: Aravinda, H. W. K., Sirajudeen, R., Karunathilake, S., de Silva, N., Kaur, R., & Ranathunga, S. (2025). SinLlama -- A Large Language Model for Sinhala. arXiv preprint arXiv:2508.09115.


Model Card Authors

  • Based on information from the SinLlama authors:contentReference[oaicite:4]{index=4}

Model Card Contact

Framework versions

  • PEFT 0.13.2
  • Transformers (latest at time of release)
Downloads last month
763
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for polyglots/SinLlama_v01

Adapter
(649)
this model

Dataset used to train polyglots/SinLlama_v01

Space using polyglots/SinLlama_v01 1