---
base_model: unsloth/Qwen2.5-3B-Instruct
tags:
- qwen2.5
- instruct
- unsloth
- vietnamese
- inference-ready
- production-ready
language:
- en
- zh
- vi
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
---

# Qwen-2.5 3B Instruct - Production Ready

🚀 **Verified working với Hugging Face Inference Endpoints!**

Đây là copy của `unsloth/Qwen2.5-3B-Instruct` được optimize cho production deployment. Model này đã được test và verified hoạt động hoàn hảo với HF Inference Endpoints.

## ✨ Đặc điểm

- ✅ **Inference Endpoints Ready**: Verified hoạt động 100% với HF Inference Endpoints
- ✅ **No Quantization Issues**: Không có vấn đề quantization với TGI
- ✅ **Production Optimized**: Sẵn sàng cho production environment
- ✅ **Vietnamese Excellence**: Hỗ trợ tiếng Việt xuất sắc
- ✅ **Multi-language**: Hỗ trợ 29+ ngôn ngữ
- ✅ **High Performance**: 3B parameters với hiệu suất cao

## 🚀 Quick Deploy

**1-Click Deploy trên Inference Endpoints:**

1. 🔗 Vào [LuvU4ever/qwen2.5-3b-qlora-merged-v3](https://huggingface.co/LuvU4ever/qwen2.5-3b-qlora-merged-v3)
2. 🚀 Click **Deploy** → **Inference Endpoints**  
3. ⚙️ Chọn **GPU [small]** instance
4. ✅ Click **Create Endpoint**

## 💻 Cách sử dụng

### Local Inference

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "LuvU4ever/qwen2.5-3b-qlora-merged-v3",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("LuvU4ever/qwen2.5-3b-qlora-merged-v3")

# Chat với model
messages = [
    {"role": "user", "content": "Xin chào! Bạn có thể giúp tôi gì?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
print(response)
```

### API Usage (Inference Endpoints)

```python
import requests
import json

# Cấu hình API
API_URL = "YOUR_ENDPOINT_URL"  # Lấy từ Inference Endpoints
headers = {
    "Authorization": "Bearer YOUR_HF_TOKEN",
    "Content-Type": "application/json"
}

def chat_with_model(message, max_tokens=200):
    payload = {
        "inputs": f"<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n",
        "parameters": {
            "max_new_tokens": max_tokens,
            "temperature": 0.7,
            "do_sample": True,
            "stop": ["<|im_end|>"],
            "return_full_text": False
        }
    }
    
    response = requests.post(API_URL, headers=headers, json=payload)
    
    if response.status_code == 200:
        result = response.json()
        return result[0]["generated_text"].strip()
    else:
        return f"Error: {response.status_code} - {response.text}"

# Sử dụng
response = chat_with_model("Việt Nam có những món ăn truyền thống nào?")
print(response)
```

### Batch Processing

```python
def batch_chat(messages_list):
    results = []
    for msg in messages_list:
        response = chat_with_model(msg)
        results.append({"question": msg, "answer": response})
    return results

# Example
questions = [
    "Hà Nội có gì đặc biệt?",
    "Cách nấu phở bò?", 
    "Lịch sử Việt Nam có gì thú vị?"
]

results = batch_chat(questions)
for item in results:
    print(f"Q: {item['question']}")
    print(f"A: {item['answer']}\n")
```

## 📊 Specifications

| Spec | Value |
|------|-------|
| Model Size | ~3B parameters |
| Architecture | Qwen2.5 |
| Context Length | 32,768 tokens |
| Languages | 29+ languages |
| Deployment | ✅ HF Inference Endpoints |
| Format | Safetensors |
| License | Apache 2.0 |

## 🎯 Use Cases

- 💬 **Chatbots**: Customer service, virtual assistants
- 📝 **Content Generation**: Blog posts, articles, creative writing  
- 🔍 **Q&A Systems**: Knowledge bases, FAQ automation
- 🌐 **Multi-language**: Translation và cross-language tasks
- 💼 **Business**: Report generation, email drafting
- 🎓 **Education**: Tutoring, explanation generation

## 🔧 Chat Format

Model sử dụng Qwen chat template:

```
<|im_start|>user
Your question here
<|im_end|>
<|im_start|>assistant
AI response here
<|im_end|>
```

## ⚠️ Important Notes

- Model hoạt động tốt nhất với **temperature 0.7-0.8**
- Sử dụng **stop tokens** `["<|im_end|>"]` để tránh over-generation
- Với câu hỏi tiếng Việt, model cho kết quả rất tự nhiên
- **Verified compatibility** với TGI container

## 🏆 Performance

- ✅ **Inference Endpoints**: Tested and verified working
- ⚡ **Speed**: ~20-50 tokens/second on GPU small
- 🎯 **Accuracy**: Excellent cho Vietnamese và English
- 💾 **Memory**: ~6GB VRAM for inference

---

## 📞 Support

- 🐛 **Issues**: Report tại GitHub issues
- 📚 **Docs**: Xem Qwen2.5 documentation  
- 💬 **Community**: HuggingFace discussions

**🎉 Ready for production deployment!**