|
--- |
|
base_model: Qwen/Qwen2.5-3B-Instruct |
|
library_name: transformers |
|
model_name: qwen-2.5-3b-r1-countdown |
|
tags: |
|
- generated_from_trainer |
|
- trl |
|
- grpo |
|
- r1 |
|
- rl |
|
licence: qwen-research |
|
--- |
|
|
|
# Model Card for `qwen-2.5-3b-r1-countdown` a mini R1 experiments |
|
|
|
This model is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct). |
|
It has been trained using [TRL](https://github.com/huggingface/trl) and GRPO on the Countdown game. |
|
|
|
If you want to learn how to replicate this model and reproduce your own Deepseek R1 "aha" moment, check out my [blog post](https://www.philschmid.com/mini-deepseek-r1). |
|
|
|
|
|
## Quick start |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
from datasets import load_dataset |
|
from random import randint |
|
|
|
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512) |
|
|
|
# use revision without "checkpoints-" as vLLM downloads all of them |
|
llm = LLM(model="philschmid/qwen-2.5-3b-r1-countdown", revision="099c0f8cbfc522e7c3a476edfb749f576b164539") |
|
|
|
# Load dataset from Hugging Face Hub |
|
dataset_id = "Jiayi-Pan/Countdown-Tasks-3to4" |
|
dataset = load_dataset(dataset_id, split="train") |
|
sample = dataset[randint(0, len(dataset))] |
|
|
|
# create conversation |
|
messages = [ |
|
{"role": "system", "content": "You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer."}, |
|
{"role": "user", "content": f"Using the numbers {sample['nums']}, create an equation that equals {sample['target']}. You can use basic arithmetic operations (+, -, *, /) one or multiple times but each number can only be used once. Show your work in <think> </think> tags. And return the final equation in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>. Think step by step inside <think> tags."}, |
|
{"role": "assistant", "content": "Let me solve this step by step.\n<think>"} |
|
] |
|
# generate response |
|
res = llm.generate(llm.get_tokenizer().apply_chat_template(messages, tokenize=False, continue_final_message=True), sampling_params) |
|
res = "<think>" + res[0].outputs[0].text |
|
print(res) |
|
|
|
# <think> We need to use the numbers 37, 15, 4, and 13 with basic arithmetic operations to make 16. Let's try different combinations: |
|
# - 37 - 15 - 4 - 13 = 6 (too low) |
|
# - 37 - 15 + 4 - 13 = 13 (too low) |
|
# - 37 + 15 - 4 - 13 = 35 (too high) |
|
# - 37 - 15 + 4 + 13 = 39 (too high) |
|
# - 15 + 4 + 13 - 37 = -1 (too low) |
|
# - 37 + 15 + 4 - 13 = 43 (too high) |
|
# - 15 + 4 * 13 / 37 = 15 + 52 / 37 (not an integer) |
|
# - 15 * 4 / 37 - 37 = -28.24 (not a whole number) |
|
# - 4 * 13 / 15 - 37 = 41.3333 (not a whole number) |
|
# After all combinations, I got not any integer result as 16. |
|
# </think> |
|
# <answer> 37 - 15 + 4 + 13 </answer> |
|
``` |
|
|
|
## Training procedure |
|
|
|
This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300). |
|
|
|
### Framework versions |
|
|
|
- TRL: 0.14.0 |
|
- Transformers: 4.48.1 |
|
- Pytorch: 2.5.1+cu121 |
|
- Datasets: 3.1.0 |
|
- Tokenizers: 0.21.0 |
|
|