metadata
base_model:
- Qwen/Qwen3-8B
Qwen3 AWQ Quantized Model Collection
This repository provides AWQ (Activation-aware Weight Quantization) versions of Qwen3 models, optimized for efficient deployment on consumer hardware while maintaining strong performance.
Models Available
- Qwen3-32B-AWQ - 4-bit quantized, 32B parameters
- Qwen3-14B-AWQ - 4-bit quantized, 14B parameters
- Qwen3-8B-AWQ - 4-bit quantized, 8B parameters
- Qwen3-4B-AWQ - 4-bit quantized, 4B parameters
Quantization Details
- Weights: 4-bit precision (AWQ)
- Activations: 16-bit precision
- Benefits:
- Up to 3x memory reduction vs FP16
- Up to 3x inference speedup on supported hardware
- Minimal loss in model quality
Features
- Multilingual: Supports 100+ languages
- Long Context: Native 32K context, extendable with YaRN to 131K tokens
- Efficient Inference: Optimized for NVIDIA GPUs with Tensor Core support
Usage
With Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("abhishekchohan/Qwen3-8B-AWQ", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("abhishekchohan/Qwen3-8B-AWQ")
messages = [{"role": "user", "content": "Explain quantum computing."}]
text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
With vLLM
vllm serve abhishekchohan/Qwen3-8B-AWQ \
--chat-template templates/chat_template.jinja \
--enable-expert-parallel \
--tensor-parallel-size 4
Citation
If you use these models, please cite:
@misc{qwen3,
title = {Qwen3 Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://github.com/QwenLM/Qwen3}
}