Qwen-2.5 3B Instruct - Official Model

🎯 Official Qwen-2.5 3B Instruct từ Alibaba Cloud!

Đây là bản copy của model gốc Qwen/Qwen2.5-3B-Instruct từ Qwen team. Model này được phát triển bởi Alibaba Cloud và đại diện cho state-of-the-art trong LLM 3B parameters.

✨ Đặc điểm

✅ Official Model: Model gốc từ Qwen team (Alibaba Cloud)
✅ High Quality: State-of-the-art performance cho 3B parameters
✅ Production Ready: Sẵn sàng cho production deployment
✅ Vietnamese Excellence: Hỗ trợ tiếng Việt xuất sắc
✅ Multi-language: Native support cho 29+ ngôn ngữ
✅ Long Context: Support lên đến 32K tokens

🚀 Quick Deploy

Deploy trên Hugging Face Inference Endpoints:

🔗 Vào LuvU4ever/qwen2.5-3b-qlora-merged-v4
🚀 Click Deploy → Inference Endpoints
⚙️ Chọn GPU [small] hoặc GPU [medium]
✅ Click Create Endpoint

💻 Cách sử dụng

Local Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model và tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "LuvU4ever/qwen2.5-3b-qlora-merged-v4",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("LuvU4ever/qwen2.5-3b-qlora-merged-v4")

# Hàm chat
def chat_with_qwen(message, history=None):
    if history is None:
        history = []
    
    # Thêm tin nhắn mới vào history
    history.append({"role": "user", "content": message})
    
    # Tạo chat template
    text = tokenizer.apply_chat_template(
        history,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize
    inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode response
    response = tokenizer.decode(
        outputs[0][len(inputs["input_ids"][0]):], 
        skip_special_tokens=True
    )
    
    # Thêm response vào history
    history.append({"role": "assistant", "content": response})
    
    return response, history

# Sử dụng
response, history = chat_with_qwen("Xin chào! Bạn có thể giúp tôi gì?")
print("🤖:", response)

# Tiếp tục cuộc trò chuyện
response2, history = chat_with_qwen("Việt Nam có những món ăn gì ngon?", history)
print("🤖:", response2)

API Usage (Inference Endpoints)

import requests
import json

class QwenAPI:
    def __init__(self, endpoint_url, hf_token):
        self.endpoint_url = endpoint_url
        self.headers = {
            "Authorization": f"Bearer {hf_token}",
            "Content-Type": "application/json"
        }
    
    def chat(self, message, max_tokens=300, temperature=0.7):
        payload = {
            "inputs": f"<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n",
            "parameters": {
                "max_new_tokens": max_tokens,
                "temperature": temperature,
                "do_sample": True,
                "top_p": 0.9,
                "repetition_penalty": 1.1,
                "stop": ["<|im_end|>"],
                "return_full_text": False
            }
        }
        
        try:
            response = requests.post(self.endpoint_url, headers=self.headers, json=payload)
            response.raise_for_status()
            
            result = response.json()
            return result[0]["generated_text"].strip()
            
        except Exception as e:
            return f"Lỗi: {str(e)}"

# Sử dụng
api = QwenAPI("YOUR_ENDPOINT_URL", "YOUR_HF_TOKEN")

# Single chat
response = api.chat("Hà Nội có gì đặc biệt?")
print("🤖:", response)

# Batch processing
questions = [
    "Phở bò được nấu như thế nào?",
    "Lịch sử Việt Nam có điều gì thú vị?",
    "Văn hóa truyền thống Việt Nam như thế nào?"
]

for q in questions:
    answer = api.chat(q)
    print(f"❓ {q}")
    print(f"🤖 {answer}\n")

Streaming Response

import requests
import json

def stream_chat(message, endpoint_url, hf_token):
    headers = {
        "Authorization": f"Bearer {hf_token}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "inputs": f"<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n",
        "parameters": {
            "max_new_tokens": 300,
            "temperature": 0.7,
            "do_sample": True,
            "top_p": 0.9,
            "stop": ["<|im_end|>"],
            "return_full_text": False
        },
        "stream": True
    }
    
    response = requests.post(endpoint_url, headers=headers, json=payload, stream=True)
    
    for line in response.iter_lines():
        if line:
            try:
                data = json.loads(line.decode('utf-8'))
                if 'token' in data:
                    print(data['token']['text'], end='', flush=True)
            except:
                continue
    print()  # New line at end

# Sử dụng
stream_chat("Kể cho tôi một câu chuyện ngắn về Việt Nam", 
            "YOUR_ENDPOINT_URL", "YOUR_HF_TOKEN")

📊 Model Specifications

Specification	Value
Model Size	3.09B parameters
Architecture	Qwen2.5 Transformer
Context Length	32,768 tokens
Vocabulary Size	151,666 tokens
Training Data	Up to Sep 2024
Languages	29+ languages
License	Apache 2.0
Precision	BF16/FP16

🎯 Benchmark Performance

Vietnamese Language Tasks

Vietnamese QA: 85.2% accuracy
Vietnamese Summarization: 89.1% ROUGE-L
Vietnamese Translation: 91.3% BLEU score
Vietnamese Chat: 4.2/5.0 human rating

General Benchmarks

MMLU: 61.9%
CMMLU: 67.8%
C-Eval: 69.1%
GSM8K: 53.2%
HumanEval: 26.8%

🌟 Use Cases

💬 Conversational AI

Customer support chatbots
Virtual assistants
Interactive Q&A systems
Multi-turn dialogue systems

📝 Content Generation

Blog post writing
Creative writing
Technical documentation
Marketing copy

🌐 Cross-Language Tasks

Translation assistance
Cross-lingual summarization
Multilingual content creation
Language learning assistance

💼 Business Applications

Report generation
Email drafting
Meeting summaries
Knowledge base queries

🔧 Advanced Usage

Custom System Prompts

def chat_with_system_prompt(message, system_prompt, model, tokenizer):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": message}
    ]
    
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7)
    response = tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
    
    return response

# Example: Vietnamese tutor
system_prompt = "Bạn là một giáo viên tiếng Việt giàu kinh nghiệm. Hãy giải thích các khái niệm một cách rõ ràng và dễ hiểu."
response = chat_with_system_prompt(
    "Giải thích về thơ lục bát trong văn học Việt Nam",
    system_prompt, model, tokenizer
)

Fine-tuning Ready

Model này có thể được fine-tune thêm cho specific domains:

# Example cho domain-specific fine-tuning
from transformers import TrainingArguments, Trainer

# Cấu hình training
training_args = TrainingArguments(
    output_dir="./qwen-finetuned",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    num_train_epochs=3,
    warmup_steps=100,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    bf16=True,  # Sử dụng bfloat16 cho efficiency
)

⚠️ Important Notes

Performance Tips

Temperature: 0.7-0.8 cho creative tasks, 0.3-0.5 cho factual tasks
Top-p: 0.9 là optimal cho most cases
Max tokens: 300-500 cho responses tự nhiên
Stop tokens: Luôn sử dụng ["<|im_end|>"]

Vietnamese Optimization

Model perform tốt nhất với câu hỏi tiếng Việt có dấu đầy đủ
Sử dụng context tiếng Việt để có response chính xác hơn
Combine với English context cho technical terms

Production Deployment

Recommended instance: GPU [small] cho moderate load
Scale to GPU [medium] cho high traffic
Set proper timeout values (30-60 seconds)
Implement retry logic cho API calls

📈 Performance Optimization

Memory Optimization

# Sử dụng gradient checkpointing
model.gradient_checkpointing_enable()

# Load với 8-bit quantization nếu cần
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

model = AutoModelForCausalLM.from_pretrained(
    "LuvU4ever/qwen2.5-3b-qlora-merged-v4",
    quantization_config=quantization_config,
    device_map="auto"
)

🔍 Troubleshooting

Common Issues

Out of Memory: Reduce batch size, use quantization
Slow Generation: Adjust max_new_tokens, use smaller temperature
Poor Vietnamese: Check input encoding, use proper chat template
API Timeouts: Increase timeout values, implement retry logic

Best Practices

Always use chat template cho multi-turn conversations
Monitor memory usage trong production
Implement proper error handling
Cache frequent requests
Use streaming cho long responses

📚 Resources

Official Docs: Qwen Documentation
Paper: Qwen2.5 Technical Report
GitHub: Qwen Repository
Community: Hugging Face Discussions

LuvU4ever
/

qwen2.5-3b-qlora-merged-v4