Bronsn's picture
Update README.md
b15711e verified
metadata
language:
  - lg
  - en
library_name: unsloth
pipeline_tag: text-generation
license: llama2
base_model: unsloth/gemma-2-2b-it
tags:
  - luganda
  - gemma
  - pretrained
  - wikipedia
  - unsloth
datasets:
  - wikimedia/wikipedia

Gemma-2-2b-it Pretrained for Luganda

Model Description

This is a continued pretraining of the Gemma-2-2b-it model on Luganda text data. The model has been pretrained on Wikipedia Luganda articles to adapt it for Luganda language understanding and generation.

Model Details

  • Base Model: unsloth/gemma-2-2b-it
  • Pretraining Data:
    • Luganda Wikipedia articles (wikimedia/wikipedia 20231101.lg)
  • Training Method: LoRA with unsloth optimization
  • Context Length: 2048 tokens
  • Training Hardware: Tesla T4 GPU

Training Process

The model was trained using the following configuration:

LoRA Configuration

  • LoRA rank (r): 128
  • Target modules:
    • q_proj, k_proj, v_proj, o_proj
    • gate_proj, up_proj, down_proj
    • embed_tokens, lm_head
  • LoRA alpha: 32
  • LoRA dropout: 0
  • Used RS-LoRA (Rank Stabilized LoRA)

Training Parameters

  • Batch size: 2 with gradient accumulation steps of 8
  • Learning rates:
    • General: 5e-5
    • Embeddings: 1e-6 (reduced for stability)
  • Training epochs: 10
  • Warmup steps: 10
  • Warmup ratio: 0.1
  • Weight decay: 0.01
  • Optimizer: AdamW 8-bit
  • LR scheduler: Linear

Data Processing

The training data was processed using the following template:

Ekyawandiikibwa kya Wikipedia
### Omutwe: {title}

### Akawayiro:
{text}

Checkpoints

This repository contains multiple checkpoints from the pretraining process:

  • checkpoint-500
  • checkpoint-1000
  • checkpoint-1500
  • checkpoint-2000
  • checkpoint-2500
  • checkpoint-2530 (final)

Usage

from unsloth import FastLanguageModel
import torch

# Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Bronsn/gemma-2-2b-it-pretrained",
    max_seq_length = 2048,
    dtype = None,  # Auto-detect
    load_in_4bit = True,
)

# Example usage
text = "Ekyawandiikibwa kya Wikipedia\n### Omutwe: Uganda\n\n### Akawayiro:\n"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Limitations

  • The model is specifically adapted for Luganda text understanding and generation
  • Performance may vary on dialectal variations or code-mixed text
  • The model maintains the base Gemma-2-2b-it limitations

Citation

If you use this model, please cite:

@misc{luganda-gemma-pretrained,
  author = {Bronsn},
  title = {Gemma-2-2b-it Pretrained for Luganda},
  year = {2025},
  publisher = {HuggingFace}
}

License

This model inherits the licensing terms from the base Gemma-2-2b-it model. For more details, please refer to Gemma's license.