File size: 1,650 Bytes
f0174f5 6876f06 f0174f5 04c6948 6876f06 f0174f5 6876f06 f0174f5 6876f06 cfafd1b f0174f5 6876f06 f0174f5 6876f06 f0174f5 2a2e831 23a0975 b874665 b886b6d 23a0975 bd5a2e8 23a0975 2a2e831 545e8f7 2a2e831 6876f06 f0174f5 6876f06 6c7c60b 6876f06 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
---
pipeline_tag: text-generation
inference: true
license: apache-2.0
datasets:
- simplescaling/s1K
---
**We recommend using our successor [s1.1](https://huggingface.co/simplescaling/s1.1-32B) with better performance**
# Model Summary
> s1 is a reasoning model finetuned from Qwen2.5-32B-Instruct on just 1,000 examples. It matches o1-preview & exhibits test-time scaling via budget forcing.
- **Repository:** [simplescaling/s1](https://github.com/simplescaling/s1)
- **Paper:** https://arxiv.org/abs/2501.19393
# Use
The model usage is documented [here](https://github.com/simplescaling/s1?tab=readme-ov-file#inference).
# Evaluation
| Metric | s1-32B | s1.1-32B | o1-preview | o1 | DeepSeek-R1 | DeepSeek-R1-Distill-Qwen-32B |
|---|---|---|---|---|---|---|
| # examples | 1K | 1K | ? | ? | >800K | 800K |
| AIME2024 | 56.7 | 56.7 | 40.0 | 74.4 | 79.8 | 72.6 |
| AIME2025 I | 26.7 | 60.0 | 37.5 | ? | 65.0 | 46.1 |
| MATH500 | 93.0 | 95.4 | 81.4 | 94.8 | 97.3 | 94.3 |
| GPQA-Diamond | 59.6 | 63.6 | 75.2 | 77.3 | 71.5 | 62.1 |
Note that s1-32B and s1.1-32B use budget forcing in this table; specifically ignoring end-of-thinking and appending "Wait" once or twice.
# Citation
```bibtex
@misc{muennighoff2025s1simpletesttimescaling,
title={s1: Simple test-time scaling},
author={Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto},
year={2025},
eprint={2501.19393},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.19393},
}
``` |