Qwen-2.5 3B Instruct - Production Ready

🚀 Verified working với Hugging Face Inference Endpoints!

Đây là copy của unsloth/Qwen2.5-3B-Instruct được optimize cho production deployment. Model này đã được test và verified hoạt động hoàn hảo với HF Inference Endpoints.

✨ Đặc điểm

✅ Inference Endpoints Ready: Verified hoạt động 100% với HF Inference Endpoints
✅ No Quantization Issues: Không có vấn đề quantization với TGI
✅ Production Optimized: Sẵn sàng cho production environment
✅ Vietnamese Excellence: Hỗ trợ tiếng Việt xuất sắc
✅ Multi-language: Hỗ trợ 29+ ngôn ngữ
✅ High Performance: 3B parameters với hiệu suất cao

🚀 Quick Deploy

1-Click Deploy trên Inference Endpoints:

🔗 Vào LuvU4ever/qwen2.5-3b-qlora-merged-v2
🚀 Click Deploy → Inference Endpoints
⚙️ Chọn GPU [small] instance
✅ Click Create Endpoint

💻 Cách sử dụng

Local Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "LuvU4ever/qwen2.5-3b-qlora-merged-v2",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("LuvU4ever/qwen2.5-3b-qlora-merged-v2")

# Chat với model
messages = [
    {"role": "user", "content": "Xin chào! Bạn có thể giúp tôi gì?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
print(response)

API Usage (Inference Endpoints)

import requests
import json

# Cấu hình API
API_URL = "YOUR_ENDPOINT_URL"  # Lấy từ Inference Endpoints
headers = {
    "Authorization": "Bearer YOUR_HF_TOKEN",
    "Content-Type": "application/json"
}

def chat_with_model(message, max_tokens=200):
    payload = {
        "inputs": f"<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n",
        "parameters": {
            "max_new_tokens": max_tokens,
            "temperature": 0.7,
            "do_sample": True,
            "stop": ["<|im_end|>"],
            "return_full_text": False
        }
    }
    
    response = requests.post(API_URL, headers=headers, json=payload)
    
    if response.status_code == 200:
        result = response.json()
        return result[0]["generated_text"].strip()
    else:
        return f"Error: {response.status_code} - {response.text}"

# Sử dụng
response = chat_with_model("Việt Nam có những món ăn truyền thống nào?")
print(response)

Batch Processing

def batch_chat(messages_list):
    results = []
    for msg in messages_list:
        response = chat_with_model(msg)
        results.append({"question": msg, "answer": response})
    return results

# Example
questions = [
    "Hà Nội có gì đặc biệt?",
    "Cách nấu phở bò?", 
    "Lịch sử Việt Nam có gì thú vị?"
]

results = batch_chat(questions)
for item in results:
    print(f"Q: {item['question']}")
    print(f"A: {item['answer']}\n")

📊 Specifications

Spec	Value
Model Size	~3B parameters
Architecture	Qwen2.5
Context Length	32,768 tokens
Languages	29+ languages
Deployment	✅ HF Inference Endpoints
Format	Safetensors
License	Apache 2.0

🎯 Use Cases

💬 Chatbots: Customer service, virtual assistants
📝 Content Generation: Blog posts, articles, creative writing
🔍 Q&A Systems: Knowledge bases, FAQ automation
🌐 Multi-language: Translation và cross-language tasks
💼 Business: Report generation, email drafting
🎓 Education: Tutoring, explanation generation

🔧 Chat Format

Model sử dụng Qwen chat template:

<|im_start|>user
Your question here
<|im_end|>
<|im_start|>assistant
AI response here
<|im_end|>

⚠️ Important Notes

Model hoạt động tốt nhất với temperature 0.7-0.8
Sử dụng stop tokens ["<|im_end|>"] để tránh over-generation
Với câu hỏi tiếng Việt, model cho kết quả rất tự nhiên
Verified compatibility với TGI container

🏆 Performance

✅ Inference Endpoints: Tested and verified working
⚡ Speed: ~20-50 tokens/second on GPU small
🎯 Accuracy: Excellent cho Vietnamese và English
💾 Memory: ~6GB VRAM for inference

📞 Support

🐛 Issues: Report tại GitHub issues
📚 Docs: Xem Qwen2.5 documentation
💬 Community: HuggingFace discussions

🎉 Ready for production deployment!

LuvU4ever
/

qwen2.5-3b-qlora-merged-v2