File size: 2,532 Bytes

a6dba45
 
9625242
a6dba45
9625242

---

base_model: Qwen/QwQ-32B

---
This is a quantization of the [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B).

The QwQ-32B model, developed by the Qwen Team, stands out as a medium-sized reasoning powerhouse within the Qwen series, notably excelling in tasks that require advanced thinking and problem-solving capabilities. This model, with 32.5 billion parameters, is particularly adept at handling complex reasoning tasks and outperforms traditional instruction-tuned models by a significant margin. Its architecture, enriched with transformers incorporating RoPE, SwiGLU, and RMSNorm technologies, allows it to adeptly manage extensive sequences, reaching up to 131,072 tokens. Designed for enhanced reasoning abilities, the QwQ-32B model is optimized for use in challenging downstream tasks, such as complex mathematical problems and standardized multiple-choice questions, making it a valuable asset in environments where sophisticated cognitive processing is required.
## Evaluations
This model provides an accuracy recovery of 100.0%. 

| __English__   |   __[QwQ-32B](https://huggingface.co/Qwen/QwQ-32B)__ |   __[QwQ-32B-FP8-Dynamic (this)](https://huggingface.co/cortecs/QwQ-32B-FP8-Dynamic)__ |
|:--------------|-----------------------------------------------------:|---------------------------------------------------------------------------------------:|
| Avg.          |                                                74.05 |                                                                                  74.05 |
| ARC           |                                                72.7  |                                                                                  72.8  |
| Hellaswag     |                                                75.4  |                                                                                  75.3  |

We did not check for data contamination.
     Evaluation was done using [Eval. Harness](https://github.com/EleutherAI/lm-evaluation-harness) with `limit=1000`. 
    
## Usage
Install **vLLM** and 
    run the [server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#openai-compatible-server):
    
```
python -m vllm.entrypoints.openai.api_server --model cortecs/QwQ-32B-FP8-Dynamic --max-model-len 131072 --gpu-memory-utilization 0.9
```
Access the model:
```
curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d ' {
        "model": "cortecs/QwQ-32B-FP8-Dynamic",
        "prompt": "San Francisco is a"
    } '
```