🎓 CengBot - Çukurova University Computer Engineering Assistant

Author: naholav
Date: August 2025

A bilingual (Turkish/English) AI assistant specifically trained for Çukurova University Computer Engineering Department students using LLaMA 3.2 3B with LoRA fine-tuning.

📋 Model Details

Developed by: naholav
Model type: Causal Language Model (LLaMA 3.2 3B + LoRA)
Language(s): Turkish, English
License: Apache-2.0
Finetuned from: meta-llama/Llama-3.2-3B
Library: PEFT 0.15.2
Dataset: Naholav/cukurova_university_chatbot

🔗 Project Links

GitHub Repository: https://github.com/naholav/cengbot
Model on HuggingFace: https://huggingface.co/Naholav/cengbot-lora-tr-en-cukurova
Dataset on HuggingFace: https://huggingface.co/datasets/Naholav/cukurova_university_chatbot

🤖 Advanced Inference Script

We provide a carefully optimized standalone inference script (cengbot_inference.py) that incorporates extensive prompt engineering and parameter tuning developed through iterative testing.

🔧 Script Features

Dual Prompt Engineering:

Turkish Prompt: Specifically crafted system prompt with strict response guidelines (max 4 sentences)
English Prompt: Parallel optimized prompt maintaining consistency across languages
Language Detection: Multi-layer detection using Turkish characters, common words, and langdetect library fallback

Optimized Generation Parameters:

# Carefully tuned through extensive testing
temperature = 0.7        # Balanced creativity vs consistency
max_new_tokens = 200     # Sufficient for detailed answers
repetition_penalty = 1.1 # Prevents repetitive responses
top_p = 0.95            # Nucleus sampling for quality
top_k = 50              # Top-k filtering for coherence

Smart Response Control:

Greeting Detection: Automatically reduces token usage for simple greetings (30 tokens vs 100)
Response Limiting: Enforces maximum 4 sentences with intelligent sentence splitting
Question Limiting: Restricts model to ask maximum 1 question per response
Language-Specific Processing: Different response extraction logic for Turkish vs English

Prompt Engineering Process: The system prompts were developed through extensive trial-and-error testing to:

Maintain university-specific context
Ensure concise but helpful responses
Balance friendliness with professionalism
Optimize for educational support scenarios
Handle both casual questions and technical inquiries

System Prompt Examples:

Turkish System Prompt (Optimized):

Sen Çukurova Üniversitesi Bilgisayar Mühendisliği bölümünün deneyimli dijital asistanısın. 
Öğrencilere samimi, yardımsever ve doğru bilgiler vererek destek oluyorsun.

Önemli kurallar:
- Her soruya MAKSİMUM 4 CÜMLE ile yanıt ver
- Cevapların çok kısa, net ve anlaşılır olmalı
- Gereksiz açıklamalardan ve tekrarlardan kesinlikle kaçın
- Selamlama mesajlarına tek cümlelik karşılık ver
- Sadece sorulan soruya odaklan, ekstra bilgi verme
- Cevaplarında MAKSİMUM 1 adet soru sorabilirsin

English System Prompt (Parallel):

You are an experienced digital assistant for Çukurova University Computer Engineering Department. 
You help students by providing friendly, helpful and accurate information.

Important rules:
- Answer each question with MAXIMUM 4 SENTENCES
- Keep your answers very short, clear and understandable
- Absolutely avoid unnecessary explanations and repetitions
- Reply to greeting messages with a single sentence
- Focus only on the asked question, don't provide extra information
- You can ask MAXIMUM 1 question in your answers

🚀 How to Get Started

Direct Use:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Naholav/cengbot-lora-tr-en-cukurova")

# Generate response
inputs = tokenizer("Merhaba, bilgisayar mühendisliği hakkında bilgi verebilir misin?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Simple Setup:

Install dependencies:

pip install torch transformers peft langdetect

Download and run:

wget https://raw.githubusercontent.com/naholav/cengbot/main/cengbot_inference.py
python cengbot_inference.py

That's it! The model will be downloaded automatically from HuggingFace.

🎯 Features

Dual Language Support: Automatically detects Turkish or English input using langdetect
Dual Prompt System: Separate optimized prompts for Turkish and English responses
Dual Language Validation: Separate TR/EN validation during training (TR Loss: 0.6750, EN Loss: 0.7387)
University-Specific: Trained on Çukurova University Computer Engineering data
Concise Responses: Maximum 4 sentences per response with smart length control
Smart Greeting Detection: Optimized token usage for greetings vs detailed questions
Language Detection: Advanced Turkish character and word detection fallbacks
Easy Setup: Just copy and run the inference script

📋 Model Details

Architecture:

Base Model: meta-llama/Llama-3.2-3B (3.2B parameters, 28 layers, 3072 hidden size)
Model Type: LLaMA (torch.bfloat16)
Fine-tuning Method: LoRA (Low-Rank Adaptation)
LoRA Config: r=16, alpha=32, dropout=0.1
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Specifications:

Languages: Turkish, English (perfectly balanced bilingual training)
Domain: Computer Engineering Education (Çukurova University specific)
Response Style: Concise, helpful, friendly (maximum 4 sentences)
Prompt Engineering: Dual system prompts with language-specific optimization
Language Detection: Advanced Turkish character + word detection with langdetect fallback
Generation Parameters: temp=0.7, top_p=0.95, top_k=50, rep_penalty=1.1

Hardware & Performance:

Training Hardware: RTX 5090 32GB (90% utilization), 128GB RAM, AMD Ryzen 9 9950X
Best Validation: 0.6920 overall, 0.6636 TR, 0.7268 EN
Inference Speed: Optimized for real-time chat with greeting detection

💻 System Requirements

For Inference:

GPU: CUDA-compatible (8GB+ VRAM recommended)
RAM: 8GB minimum, 16GB recommended
Python: 3.8+
Internet: For initial model download

📖 Usage Examples

Turkish:

👤 You: Merhaba, bilgisayar mühendisliği hakkında bilgi verebilir misin?
🤖 CengBot: Merhaba! Bilgisayar mühendisliği, yazılım geliştirme, sistem tasarımı ve teknoloji çözümleri üzerine odaklanan bir disiplindir. Çukurova Üniversitesi'nde güçlü bir müfredatımız var. Hangi konuda detay istiyorsun?

English:

👤 You: What programming languages should I learn first?
🤖 CengBot: I recommend starting with Python for its simplicity and versatility, then moving to Java or C++ for fundamental concepts. These languages provide a solid foundation for computer engineering. Which area interests you most - web development, systems programming, or data science?

🔧 Training Performance

Dataset Information:

Total Examples: 22,524 (11,187 Turkish, 11,337 English)
Training Set: 20,948 examples
Validation Sets: 783 Turkish, 793 English (7% split each)

Training Progress:

Total Steps: 1170/1312 (Early stopping activated at step 1170)
Best Eval Loss: 0.6920 (Step 975) → Model saved from best checkpoint
Final Train Loss: 0.5104 (continuous improvement)
Early Stopping: Patience=3, triggered due to eval loss increase
Training Time: ~3 hours on RTX 5090

Best Model Performance (Step 975):

Overall Eval Loss: 0.6920
Turkish Validation Loss: 0.6636
English Validation Loss: 0.7268
Average Loss: 0.6952

Training Features:

Dual Language Evaluation: Separate TR/EN validation tracking every step
Early Stopping: Patience=3 prevents overfitting, saves best model
Advanced Optimization: Mixed precision (BF16), gradient checkpointing
LoRA Configuration: r=16, alpha=32, dropout=0.1, targets all linear layers
Hardware Optimization: RTX 5090 32GB (90% VRAM usage), 128GB RAM, 16-core CPU
Memory Management: Pre-cached datasets in RAM, persistent workers

Prompt Engineering & Data Processing:

Dual Template System: Separate training templates for Turkish and English
Response Masking: Only LoRA adapter learns from assistant responses, not user questions
Template Consistency: Identical structure between training and inference prompts
Balanced Dataset: Perfectly balanced Turkish (11,187) and English (11,337) examples
Response Quality Control: Training data filtered for concise, educational responses

⚠️ Limitations and Bias

Domain Specific: Optimized for computer engineering education
Response Length: Limited to maximum 4 sentences for conciseness
University Context: Trained specifically on Çukurova University data
Language Balance: Balanced TR/EN training but may show slight performance differences

🔧 Technical Specifications

Hardware: RTX 5090 32GB, 128GB RAM, AMD Ryzen 9 9950X
Training Time: ~3 hours
Framework: PEFT 0.15.2, Transformers, PyTorch
Optimization: Mixed precision (BF16), gradient checkpointing

📈 Environmental Impact

Hardware Type: NVIDIA RTX 5090
Hours Used: ~3 hours
Compute Region: Local training setup
Carbon Emissions: Minimal due to efficient LoRA training

🤝 Contributing

Fork the repository
Create your feature branch
Commit your changes
Push to the branch
Create a Pull Request

📞 Contact

Issues: https://github.com/naholav/cengbot/issues
Author: naholav

📚 Citation

@misc{cengbot2025,
  title={CengBot: A Bilingual AI Assistant for Computer Engineering Education},
  author={naholav},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/Naholav/cengbot-lora-tr-en-cukurova}
}

Happy learning! 🎓

Downloads last month: 1

Model tree for Naholav/cengbot-lora-tr-en-cukurova

Base model

meta-llama/Llama-3.2-3B

Adapter

(236)

this model

Dataset used to train Naholav/cengbot-lora-tr-en-cukurova

Collection including Naholav/cengbot-lora-tr-en-cukurova

CengBot: Cukurova University Chatbot

Collection

Turkish-English bilingual chatbot dataset and LoRA model for Cukurova University • 2 items • Updated Nov 25, 2025