emre/gemma-2-9b-Turkish-Lora-Continue-Pre-Trained

Model Details

  • Commercially only available to BBVA Group, Free to use for academic and research purposes

Model Description

This model is a continued pre-trained version of the google/gemma-2-9b base model, trained on the Turkish Wikipedia dataset (Alaeddin/wikipedia-turkish). The fine-tuning was performed using Low-Rank Adaptation (LoRA) to efficiently adapt the model to the Turkish language. This model aims to improve the base model's understanding and generation capabilities for Turkish text.

  • Developed by: Emre Tasar, PhDc University of Navarra / Data Scientist
  • Funded by: Self-funded research project. (820 Google Colab Calculation Hours Fee)
  • Shared by: Emre Tasar (https://huggingface.co/emre)
  • Model type: Causal Language Model
  • Language(s) (NLP): Turkish (tr)
  • License: Gemma, BBVA Only, Free for academy
  • Continue Pre Trained from model: google/gemma-2-9b

Model Sources

Uses

Direct Use

This model can be used for generating Turkish text for various natural language processing tasks, such as:

  • Text generation
  • Language modeling
  • Creative writing
  • Answering questions based on Turkish text (with appropriate prompting)

It is intended for researchers, developers, and enthusiasts interested in exploring and utilizing large language models for the Turkish language.

Downstream Use

This model can serve as a strong base for further fine-tuning on specific downstream tasks in Turkish, such as:

  • Turkish text summarization
  • Turkish question answering
  • Turkish text classification
  • Turkish dialogue generation

Out-of-Scope Use

This model should not be used for generating harmful, unethical, or biased content. As a language model trained on a large corpus of text, it may inadvertently generate such content. Users should exercise caution and responsibility when deploying this model.

Bias, Risks, and Limitations

The model was trained on the Turkish Wikipedia dataset, which may contain biases present in the original data. The model's performance may vary depending on the specific task and domain. Users should be aware of these limitations and conduct thorough evaluations for their specific use cases.

Recommendations

Users should carefully evaluate the model's output and consider potential biases before deploying it in real-world applications. Further fine-tuning on task-specific and diverse Turkish datasets can help mitigate some of these limitations.

Training Details

Training Data

The model was trained on the Alaeddin/wikipedia-turkish dataset:

  • Training Split: 1,620,000 paragraphs.
  • Validation Split: 1,000 paragraphs (disjoint from the training set).

Training Procedure

The model was trained using the Hugging Face Trainer API on a Google Colab Pro+ instance with an A100 GPU (40GB). Key settings include:

  • Quantization: 4-bit with NF4 type and double quantization (BitsAndBytesConfig).
  • LoRA Configuration:
    • Rank (r): 8
    • Alpha (lora_alpha): 32
    • Target Modules: q_proj, v_proj
    • Dropout: 0.1
  • Training Arguments:
    • Epochs: 1
    • Effective Batch Size: 8 (per_device_train_batch_size=2, gradient_accumulation_steps=4)
    • Learning Rate: 2e-5
    • Scheduler: Linear with 500 warmup steps
    • Mixed Precision: FP16
    • Evaluation Frequency: Every 5,000 steps
    • Total Steps: 202,500

Training Hyperparameters

  • Training regime: FP16 mixed precision
  • Optimizer: AdamW (fused implementation)

Speeds, Sizes, Times

  • Duration: Approximately 110 hours
  • Hardware: A100 GPU (40GB)
  • Trainable Parameters: 4,472,832 (0.0484% of total 9,246,178,816 parameters)

Evaluation

Testing Data, Factors & Metrics

Testing Data

1,000 paragraphs from the Turkish Wikipedia dataset, reserved as a validation set.

Metrics

  • Validation Loss: Measures the model's prediction error on the validation set.
  • Perplexity: Indicates how well the model predicts the next token (lower is better).

Results

Model Validation Loss Perplexity
Pre-trained Core (Gemma-2-9b) 2.5168 12.39
Continued Pre-trained (LoRA) 2.1027 8.19

The LoRA-adapted model significantly outperforms the base model on Turkish text.

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact calculator:

  • Hardware Type: A100 GPU (40GB)
  • Hours Used: 110 hours
  • Cloud Provider: Google Colab
  • Compute Region: Unknown (assumed us-central1 for estimation)
  • Carbon Emitted: ~22 kg CO2eq (based on 44 kWh at 0.5 kg CO2/kWh)

Note: Exact emissions depend on the compute region's energy mix.

How to Get Started with the Model

You can easily load and use this model using the transformers and peft libraries in Python:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

model_name = "google/gemma-2-9b"
peft_model_id = "emre/gemma-2-9b-Turkish-Lora-Continue-Pre-Trained"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the base model with 4-bit quantization for efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
    attn_implementation="eager"
)

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, peft_model_id)
model.eval()

if torch.cuda.is_available():
    model = model.to("cuda")

prompt = "Türkiye'nin başkenti neresidir?"
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**input_ids, max_new_tokens=50, num_return_sequences=1)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

@misc{tasar2023gemma2turkish,
  author = {Davut Emre Tasar},
  title = {Gemma-2-9b Turkish LoRA Continue Pre-Trained Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/emre/gemma-2-9b-Turkish-Lora-Continue-Pre-Trained}}
}
Downloads last month
86
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for emre/gemma-2-9b-Turkish-Lora-Continue-Pre-Trained

Base model

google/gemma-2-9b
Adapter
(33)
this model
Adapters
1 model

Dataset used to train emre/gemma-2-9b-Turkish-Lora-Continue-Pre-Trained