emre/gemma-2-9b-Turkish-Lora-Continue-Pre-Trained
Model Details
- Commercially only available to BBVA Group, Free to use for academic and research purposes
Model Description
This model is a continued pre-trained version of the google/gemma-2-9b
base model, trained on the Turkish Wikipedia dataset (Alaeddin/wikipedia-turkish
). The fine-tuning was performed using Low-Rank Adaptation (LoRA) to efficiently adapt the model to the Turkish language. This model aims to improve the base model's understanding and generation capabilities for Turkish text.
- Developed by: Emre Tasar, PhDc University of Navarra / Data Scientist
- Funded by: Self-funded research project. (820 Google Colab Calculation Hours Fee)
- Shared by: Emre Tasar (https://huggingface.co/emre)
- Model type: Causal Language Model
- Language(s) (NLP): Turkish (tr)
- License: Gemma, BBVA Only, Free for academy
- Continue Pre Trained from model: google/gemma-2-9b
Model Sources
Uses
Direct Use
This model can be used for generating Turkish text for various natural language processing tasks, such as:
- Text generation
- Language modeling
- Creative writing
- Answering questions based on Turkish text (with appropriate prompting)
It is intended for researchers, developers, and enthusiasts interested in exploring and utilizing large language models for the Turkish language.
Downstream Use
This model can serve as a strong base for further fine-tuning on specific downstream tasks in Turkish, such as:
- Turkish text summarization
- Turkish question answering
- Turkish text classification
- Turkish dialogue generation
Out-of-Scope Use
This model should not be used for generating harmful, unethical, or biased content. As a language model trained on a large corpus of text, it may inadvertently generate such content. Users should exercise caution and responsibility when deploying this model.
Bias, Risks, and Limitations
The model was trained on the Turkish Wikipedia dataset, which may contain biases present in the original data. The model's performance may vary depending on the specific task and domain. Users should be aware of these limitations and conduct thorough evaluations for their specific use cases.
Recommendations
Users should carefully evaluate the model's output and consider potential biases before deploying it in real-world applications. Further fine-tuning on task-specific and diverse Turkish datasets can help mitigate some of these limitations.
Training Details
Training Data
The model was trained on the Alaeddin/wikipedia-turkish dataset:
- Training Split: 1,620,000 paragraphs.
- Validation Split: 1,000 paragraphs (disjoint from the training set).
Training Procedure
The model was trained using the Hugging Face Trainer
API on a Google Colab Pro+ instance with an A100 GPU (40GB). Key settings include:
- Quantization: 4-bit with NF4 type and double quantization (
BitsAndBytesConfig
). - LoRA Configuration:
- Rank (
r
): 8 - Alpha (
lora_alpha
): 32 - Target Modules:
q_proj
,v_proj
- Dropout: 0.1
- Rank (
- Training Arguments:
- Epochs: 1
- Effective Batch Size: 8 (
per_device_train_batch_size=2
,gradient_accumulation_steps=4
) - Learning Rate: 2e-5
- Scheduler: Linear with 500 warmup steps
- Mixed Precision: FP16
- Evaluation Frequency: Every 5,000 steps
- Total Steps: 202,500
Training Hyperparameters
- Training regime: FP16 mixed precision
- Optimizer: AdamW (fused implementation)
Speeds, Sizes, Times
- Duration: Approximately 110 hours
- Hardware: A100 GPU (40GB)
- Trainable Parameters: 4,472,832 (0.0484% of total 9,246,178,816 parameters)
Evaluation
Testing Data, Factors & Metrics
Testing Data
1,000 paragraphs from the Turkish Wikipedia dataset, reserved as a validation set.
Metrics
- Validation Loss: Measures the model's prediction error on the validation set.
- Perplexity: Indicates how well the model predicts the next token (lower is better).
Results
Model | Validation Loss | Perplexity |
---|---|---|
Pre-trained Core (Gemma-2-9b) | 2.5168 | 12.39 |
Continued Pre-trained (LoRA) | 2.1027 | 8.19 |
The LoRA-adapted model significantly outperforms the base model on Turkish text.
Environmental Impact
Carbon emissions were estimated using the Machine Learning Impact calculator:
- Hardware Type: A100 GPU (40GB)
- Hours Used: 110 hours
- Cloud Provider: Google Colab
- Compute Region: Unknown (assumed us-central1 for estimation)
- Carbon Emitted: ~22 kg CO2eq (based on 44 kWh at 0.5 kg CO2/kWh)
Note: Exact emissions depend on the compute region's energy mix.
How to Get Started with the Model
You can easily load and use this model using the transformers
and peft
libraries in Python:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
model_name = "google/gemma-2-9b"
peft_model_id = "emre/gemma-2-9b-Turkish-Lora-Continue-Pre-Trained"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the base model with 4-bit quantization for efficiency
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.float16,
attn_implementation="eager"
)
# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, peft_model_id)
model.eval()
if torch.cuda.is_available():
model = model.to("cuda")
prompt = "Türkiye'nin başkenti neresidir?"
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**input_ids, max_new_tokens=50, num_return_sequences=1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
@misc{tasar2023gemma2turkish,
author = {Davut Emre Tasar},
title = {Gemma-2-9b Turkish LoRA Continue Pre-Trained Model},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/emre/gemma-2-9b-Turkish-Lora-Continue-Pre-Trained}}
}
- Downloads last month
- 86