Qwen-2.5 3B Instruct - Production Ready
π Verified working vα»i Hugging Face Inference Endpoints!
ΔΓ’y lΓ copy cα»§a unsloth/Qwen2.5-3B-Instruct
Δược optimize cho production deployment. Model nΓ y ΔΓ£ Δược test vΓ verified hoαΊ‘t Δα»ng hoΓ n hαΊ£o vα»i HF Inference Endpoints.
β¨ ΔαΊ·c Δiα»m
- β Inference Endpoints Ready: Verified hoαΊ‘t Δα»ng 100% vα»i HF Inference Endpoints
- β No Quantization Issues: KhΓ΄ng cΓ³ vαΊ₯n Δα» quantization vα»i TGI
- β Production Optimized: SαΊ΅n sΓ ng cho production environment
- β Vietnamese Excellence: Hα» trợ tiαΊΏng Viα»t xuαΊ₯t sαΊ―c
- β Multi-language: Hα» trợ 29+ ngΓ΄n ngα»―
- β High Performance: 3B parameters vα»i hiα»u suαΊ₯t cao
π Quick Deploy
1-Click Deploy trΓͺn Inference Endpoints:
- π VΓ o LuvU4ever/qwen2.5-3b-qlora-merged-v2
- π Click Deploy β Inference Endpoints
- βοΈ Chα»n GPU [small] instance
- β Click Create Endpoint
π» CΓ‘ch sα» dα»₯ng
Local Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"LuvU4ever/qwen2.5-3b-qlora-merged-v2",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("LuvU4ever/qwen2.5-3b-qlora-merged-v2")
# Chat vα»i model
messages = [
{"role": "user", "content": "Xin chà o! Bẑn có thỠgiúp tôi gì?"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
print(response)
API Usage (Inference Endpoints)
import requests
import json
# CαΊ₯u hΓ¬nh API
API_URL = "YOUR_ENDPOINT_URL" # LαΊ₯y tα»« Inference Endpoints
headers = {
"Authorization": "Bearer YOUR_HF_TOKEN",
"Content-Type": "application/json"
}
def chat_with_model(message, max_tokens=200):
payload = {
"inputs": f"<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n",
"parameters": {
"max_new_tokens": max_tokens,
"temperature": 0.7,
"do_sample": True,
"stop": ["<|im_end|>"],
"return_full_text": False
}
}
response = requests.post(API_URL, headers=headers, json=payload)
if response.status_code == 200:
result = response.json()
return result[0]["generated_text"].strip()
else:
return f"Error: {response.status_code} - {response.text}"
# Sα» dα»₯ng
response = chat_with_model("Viα»t Nam cΓ³ nhα»―ng mΓ³n Δn truyα»n thα»ng nΓ o?")
print(response)
Batch Processing
def batch_chat(messages_list):
results = []
for msg in messages_list:
response = chat_with_model(msg)
results.append({"question": msg, "answer": response})
return results
# Example
questions = [
"HΓ Nα»i cΓ³ gΓ¬ ΔαΊ·c biα»t?",
"CΓ‘ch nαΊ₯u phα» bΓ²?",
"Lα»ch sα» Viα»t Nam cΓ³ gΓ¬ thΓΊ vα»?"
]
results = batch_chat(questions)
for item in results:
print(f"Q: {item['question']}")
print(f"A: {item['answer']}\n")
π Specifications
Spec | Value |
---|---|
Model Size | ~3B parameters |
Architecture | Qwen2.5 |
Context Length | 32,768 tokens |
Languages | 29+ languages |
Deployment | β HF Inference Endpoints |
Format | Safetensors |
License | Apache 2.0 |
π― Use Cases
- π¬ Chatbots: Customer service, virtual assistants
- π Content Generation: Blog posts, articles, creative writing
- π Q&A Systems: Knowledge bases, FAQ automation
- π Multi-language: Translation vΓ cross-language tasks
- πΌ Business: Report generation, email drafting
- π Education: Tutoring, explanation generation
π§ Chat Format
Model sα» dα»₯ng Qwen chat template:
<|im_start|>user
Your question here
<|im_end|>
<|im_start|>assistant
AI response here
<|im_end|>
β οΈ Important Notes
- Model hoαΊ‘t Δα»ng tα»t nhαΊ₯t vα»i temperature 0.7-0.8
- Sα» dα»₯ng stop tokens
["<|im_end|>"]
Δα» trΓ‘nh over-generation - Vα»i cΓ’u hα»i tiαΊΏng Viα»t, model cho kαΊΏt quαΊ£ rαΊ₯t tα»± nhiΓͺn
- Verified compatibility vα»i TGI container
π Performance
- β Inference Endpoints: Tested and verified working
- β‘ Speed: ~20-50 tokens/second on GPU small
- π― Accuracy: Excellent cho Vietnamese vΓ English
- πΎ Memory: ~6GB VRAM for inference
π Support
- π Issues: Report tαΊ‘i GitHub issues
- π Docs: Xem Qwen2.5 documentation
- π¬ Community: HuggingFace discussions
π Ready for production deployment!
- Downloads last month
- 19