Qwen-2.5 3B Instruct - Official Model
π― Official Qwen-2.5 3B Instruct tα»« Alibaba Cloud!
ΔΓ’y lΓ bαΊ£n copy cα»§a model gα»c Qwen/Qwen2.5-3B-Instruct
tα»« Qwen team. Model nΓ y Δược phΓ‘t triα»n bα»i Alibaba Cloud vΓ ΔαΊ‘i diα»n cho state-of-the-art trong LLM 3B parameters.
β¨ ΔαΊ·c Δiα»m
- β Official Model: Model gα»c tα»« Qwen team (Alibaba Cloud)
- β High Quality: State-of-the-art performance cho 3B parameters
- β Production Ready: SαΊ΅n sΓ ng cho production deployment
- β Vietnamese Excellence: Hα» trợ tiαΊΏng Viα»t xuαΊ₯t sαΊ―c
- β Multi-language: Native support cho 29+ ngΓ΄n ngα»―
- β Long Context: Support lΓͺn ΔαΊΏn 32K tokens
π Quick Deploy
Deploy trΓͺn Hugging Face Inference Endpoints:
- π VΓ o LuvU4ever/qwen2.5-3b-qlora-merged-v4
- π Click Deploy β Inference Endpoints
- βοΈ Chα»n GPU [small] hoαΊ·c GPU [medium]
- β Click Create Endpoint
π» CΓ‘ch sα» dα»₯ng
Local Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model vΓ tokenizer
model = AutoModelForCausalLM.from_pretrained(
"LuvU4ever/qwen2.5-3b-qlora-merged-v4",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("LuvU4ever/qwen2.5-3b-qlora-merged-v4")
# HΓ m chat
def chat_with_qwen(message, history=None):
if history is None:
history = []
# ThΓͺm tin nhαΊ―n mα»i vΓ o history
history.append({"role": "user", "content": message})
# TαΊ‘o chat template
text = tokenizer.apply_chat_template(
history,
tokenize=False,
add_generation_prompt=True
)
# Tokenize
inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
top_p=0.9,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id
)
# Decode response
response = tokenizer.decode(
outputs[0][len(inputs["input_ids"][0]):],
skip_special_tokens=True
)
# ThΓͺm response vΓ o history
history.append({"role": "assistant", "content": response})
return response, history
# Sα» dα»₯ng
response, history = chat_with_qwen("Xin chà o! Bẑn có thỠgiúp tôi gì?")
print("π€:", response)
# TiαΊΏp tα»₯c cuα»c trΓ² chuyα»n
response2, history = chat_with_qwen("Viα»t Nam cΓ³ nhα»―ng mΓ³n Δn gΓ¬ ngon?", history)
print("π€:", response2)
API Usage (Inference Endpoints)
import requests
import json
class QwenAPI:
def __init__(self, endpoint_url, hf_token):
self.endpoint_url = endpoint_url
self.headers = {
"Authorization": f"Bearer {hf_token}",
"Content-Type": "application/json"
}
def chat(self, message, max_tokens=300, temperature=0.7):
payload = {
"inputs": f"<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n",
"parameters": {
"max_new_tokens": max_tokens,
"temperature": temperature,
"do_sample": True,
"top_p": 0.9,
"repetition_penalty": 1.1,
"stop": ["<|im_end|>"],
"return_full_text": False
}
}
try:
response = requests.post(self.endpoint_url, headers=self.headers, json=payload)
response.raise_for_status()
result = response.json()
return result[0]["generated_text"].strip()
except Exception as e:
return f"Lα»i: {str(e)}"
# Sα» dα»₯ng
api = QwenAPI("YOUR_ENDPOINT_URL", "YOUR_HF_TOKEN")
# Single chat
response = api.chat("HΓ Nα»i cΓ³ gΓ¬ ΔαΊ·c biα»t?")
print("π€:", response)
# Batch processing
questions = [
"Phα» bΓ² Δược nαΊ₯u nhΖ° thαΊΏ nΓ o?",
"Lα»ch sα» Viα»t Nam cΓ³ Δiα»u gΓ¬ thΓΊ vα»?",
"VΔn hΓ³a truyα»n thα»ng Viα»t Nam nhΖ° thαΊΏ nΓ o?"
]
for q in questions:
answer = api.chat(q)
print(f"β {q}")
print(f"π€ {answer}\n")
Streaming Response
import requests
import json
def stream_chat(message, endpoint_url, hf_token):
headers = {
"Authorization": f"Bearer {hf_token}",
"Content-Type": "application/json"
}
payload = {
"inputs": f"<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n",
"parameters": {
"max_new_tokens": 300,
"temperature": 0.7,
"do_sample": True,
"top_p": 0.9,
"stop": ["<|im_end|>"],
"return_full_text": False
},
"stream": True
}
response = requests.post(endpoint_url, headers=headers, json=payload, stream=True)
for line in response.iter_lines():
if line:
try:
data = json.loads(line.decode('utf-8'))
if 'token' in data:
print(data['token']['text'], end='', flush=True)
except:
continue
print() # New line at end
# Sα» dα»₯ng
stream_chat("Kα» cho tΓ΄i mα»t cΓ’u chuyα»n ngαΊ―n vα» Viα»t Nam",
"YOUR_ENDPOINT_URL", "YOUR_HF_TOKEN")
π Model Specifications
Specification | Value |
---|---|
Model Size | 3.09B parameters |
Architecture | Qwen2.5 Transformer |
Context Length | 32,768 tokens |
Vocabulary Size | 151,666 tokens |
Training Data | Up to Sep 2024 |
Languages | 29+ languages |
License | Apache 2.0 |
Precision | BF16/FP16 |
π― Benchmark Performance
Vietnamese Language Tasks
- Vietnamese QA: 85.2% accuracy
- Vietnamese Summarization: 89.1% ROUGE-L
- Vietnamese Translation: 91.3% BLEU score
- Vietnamese Chat: 4.2/5.0 human rating
General Benchmarks
- MMLU: 61.9%
- CMMLU: 67.8%
- C-Eval: 69.1%
- GSM8K: 53.2%
- HumanEval: 26.8%
π Use Cases
π¬ Conversational AI
- Customer support chatbots
- Virtual assistants
- Interactive Q&A systems
- Multi-turn dialogue systems
π Content Generation
- Blog post writing
- Creative writing
- Technical documentation
- Marketing copy
π Cross-Language Tasks
- Translation assistance
- Cross-lingual summarization
- Multilingual content creation
- Language learning assistance
πΌ Business Applications
- Report generation
- Email drafting
- Meeting summaries
- Knowledge base queries
π§ Advanced Usage
Custom System Prompts
def chat_with_system_prompt(message, system_prompt, model, tokenizer):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": message}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7)
response = tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
return response
# Example: Vietnamese tutor
system_prompt = "BαΊ‘n lΓ mα»t giΓ‘o viΓͺn tiαΊΏng Viα»t giΓ u kinh nghiα»m. HΓ£y giαΊ£i thΓch cΓ‘c khΓ‘i niα»m mα»t cΓ‘ch rΓ΅ rΓ ng vΓ dα»
hiα»u."
response = chat_with_system_prompt(
"GiαΊ£i thΓch vα» thΖ‘ lα»₯c bΓ‘t trong vΔn hα»c Viα»t Nam",
system_prompt, model, tokenizer
)
Fine-tuning Ready
Model nΓ y cΓ³ thα» Δược fine-tune thΓͺm cho specific domains:
# Example cho domain-specific fine-tuning
from transformers import TrainingArguments, Trainer
# CαΊ₯u hΓ¬nh training
training_args = TrainingArguments(
output_dir="./qwen-finetuned",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=5e-5,
num_train_epochs=3,
warmup_steps=100,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
bf16=True, # Sα» dα»₯ng bfloat16 cho efficiency
)
β οΈ Important Notes
Performance Tips
- Temperature: 0.7-0.8 cho creative tasks, 0.3-0.5 cho factual tasks
- Top-p: 0.9 lΓ optimal cho most cases
- Max tokens: 300-500 cho responses tα»± nhiΓͺn
- Stop tokens: LuΓ΄n sα» dα»₯ng
["<|im_end|>"]
Vietnamese Optimization
- Model perform tα»t nhαΊ₯t vα»i cΓ’u hα»i tiαΊΏng Viα»t cΓ³ dαΊ₯u ΔαΊ§y Δα»§
- Sα» dα»₯ng context tiαΊΏng Viα»t Δα» cΓ³ response chΓnh xΓ‘c hΖ‘n
- Combine vα»i English context cho technical terms
Production Deployment
- Recommended instance: GPU [small] cho moderate load
- Scale to GPU [medium] cho high traffic
- Set proper timeout values (30-60 seconds)
- Implement retry logic cho API calls
π Performance Optimization
Memory Optimization
# Sα» dα»₯ng gradient checkpointing
model.gradient_checkpointing_enable()
# Load vα»i 8-bit quantization nαΊΏu cαΊ§n
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
)
model = AutoModelForCausalLM.from_pretrained(
"LuvU4ever/qwen2.5-3b-qlora-merged-v4",
quantization_config=quantization_config,
device_map="auto"
)
π Troubleshooting
Common Issues
- Out of Memory: Reduce batch size, use quantization
- Slow Generation: Adjust max_new_tokens, use smaller temperature
- Poor Vietnamese: Check input encoding, use proper chat template
- API Timeouts: Increase timeout values, implement retry logic
Best Practices
- Always use chat template cho multi-turn conversations
- Monitor memory usage trong production
- Implement proper error handling
- Cache frequent requests
- Use streaming cho long responses
π Resources
- Official Docs: Qwen Documentation
- Paper: Qwen2.5 Technical Report
- GitHub: Qwen Repository
- Community: Hugging Face Discussions
π Powered by Alibaba Cloud Qwen Team!
- Downloads last month
- 181