Qwen3-8B-AWQ / README.md

abhishekchohan

Create README.md

f2ec800 verified 7 months ago

preview code

raw

history blame contribute delete

1.89 kB

metadata

base_model:
  - Qwen/Qwen3-8B

Qwen3 AWQ Quantized Model Collection

This repository provides AWQ (Activation-aware Weight Quantization) versions of Qwen3 models, optimized for efficient deployment on consumer hardware while maintaining strong performance.

Models Available

Qwen3-32B-AWQ - 4-bit quantized, 32B parameters
Qwen3-14B-AWQ - 4-bit quantized, 14B parameters
Qwen3-8B-AWQ - 4-bit quantized, 8B parameters
Qwen3-4B-AWQ - 4-bit quantized, 4B parameters

Quantization Details

Weights: 4-bit precision (AWQ)
Activations: 16-bit precision
Benefits:
- Up to 3x memory reduction vs FP16
- Up to 3x inference speedup on supported hardware
- Minimal loss in model quality

Features

Multilingual: Supports 100+ languages
Long Context: Native 32K context, extendable with YaRN to 131K tokens
Efficient Inference: Optimized for NVIDIA GPUs with Tensor Core support

Usage

With Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("abhishekchohan/Qwen3-8B-AWQ", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("abhishekchohan/Qwen3-8B-AWQ")

messages = [{"role": "user", "content": "Explain quantum computing."}]
text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

With vLLM

vllm serve abhishekchohan/Qwen3-8B-AWQ \
    --chat-template templates/chat_template.jinja \
    --enable-expert-parallel \
    --tensor-parallel-size 4

Citation

If you use these models, please cite:

@misc{qwen3,
    title = {Qwen3 Technical Report},
    author = {Qwen Team},
    year = {2025},
    url = {https://github.com/QwenLM/Qwen3}
}