shimmyshimmer commited on
Commit
1d9653e
·
verified ·
1 Parent(s): 5b6ebc1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +157 -0
README.md ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: agentica-org/DeepScaleR-1.5B-Preview
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ license: mit
7
+ tags:
8
+ - deepseek
9
+ - unsloth
10
+ - transformers
11
+ - qwen
12
+ ---
13
+ <div>
14
+ <p style="margin-bottom: 0; margin-top: 0;">
15
+ <strong>See <a href="https://huggingface.co/collections/unsloth/deepseek-r1-all-versions-678e1c48f5d2fce87892ace5">our collection</a> for versions of Deepseek-R1 including GGUF & 4-bit formats.</strong>
16
+ </p>
17
+ <p style="margin-bottom: 0;">
18
+ <em>Unsloth's DeepSeek-R1 <a href="https://unsloth.ai/blog/deepseekr1-dynamic">1.58-bit + 2-bit Dynamic Quants</a> is selectively quantized, greatly improving accuracy over standard 1-bit/2-bit.</em>
19
+ </p>
20
+ <div style="display: flex; gap: 5px; align-items: center; ">
21
+ <a href="https://github.com/unslothai/unsloth/">
22
+ <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
23
+ </a>
24
+ <a href="https://discord.gg/unsloth">
25
+ <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
26
+ </a>
27
+ <a href="https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-r1-on-your-own-local-device">
28
+ <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
29
+ </a>
30
+ </div>
31
+ <h1 style="margin-top: 0rem;">Finetune your own Reasoning model like R1 with Unsloth!</h2>
32
+ </div>
33
+
34
+ We have a free Google Colab notebook for turning Llama 3.1 (8B) into a reasoning model: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb
35
+
36
+ ## ✨ Finetune for Free
37
+
38
+ All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
39
+
40
+ | Unsloth supports | Free Notebooks | Performance | Memory use |
41
+ |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
42
+ | **GRPO with Phi-4 (14B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4_(14B)-GRPO.ipynb) | 2x faster | 80% less |
43
+ | **Llama-3.2 (3B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) | 2.4x faster | 58% less |
44
+ | **Llama-3.2 (11B vision)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb) | 2x faster | 60% less |
45
+ | **Qwen2 VL (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb) | 1.8x faster | 60% less |
46
+ | **Qwen2.5 (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb) | 2x faster | 60% less |
47
+ | **Llama-3.1 (8B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb) | 2.4x faster | 58% less |
48
+ | **Phi-3.5 (mini)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_3.5_Mini-Conversational.ipynb) | 2x faster | 50% less |
49
+ | **Gemma 2 (9B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb) | 2.4x faster | 58% less |
50
+ | **Mistral (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb) | 2.2x faster | 62% less |
51
+
52
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="200"/>](https://docs.unsloth.ai)
53
+
54
+ - This [Llama 3.2 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates.
55
+ - This [text completion notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr.
56
+ - \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.
57
+
58
+ <div align="center">
59
+ <span style="font-family: default; font-size: 1.5em;">DeepScaleR-1.5B-Preview</span>
60
+ <div>
61
+ 🚀 Democratizing Reinforcement Learning for LLMs 🌟
62
+ </div>
63
+ </div>
64
+ <br>
65
+ <div align="center" style="line-height: 1;">
66
+ <a href="https://github.com/agentica-project/deepscaler" style="margin: 2px;">
67
+ <img alt="Code" src="https://img.shields.io/badge/DeepScaleR-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
68
+ </a>
69
+ <a href="https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2" target="_blank" style="margin: 2px;">
70
+ <img alt="Blog" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
71
+ </a>
72
+ <a href="https://x.com/Agentica_/status/1889006266661617779" style="margin: 2px;">
73
+ <img alt="X.ai" src="https://img.shields.io/badge/Agentica-white?style=for-the-badge&logo=X&logoColor=000&color=000&labelColor=white" style="display: inline-block; vertical-align: middle;"/>
74
+ </a>
75
+ <a href="https://huggingface.co/agentica-org" style="margin: 2px;">
76
+ <img alt="Hugging Face" src="https://img.shields.io/badge/Agentica-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
77
+ </a>
78
+ </div>
79
+ </div>
80
+ </div>
81
+
82
+ ## DeepScaleR Overview
83
+ DeepScaleR-1.5B-Preview is a language model fine-tuned from DeepSeek-R1-Distilled-Qwen-1.5B using distributed reinforcement learning (RL) to scale up to long context lengths. The model achieves 43.1% Pass@1 accuracy on AIME 2024, representing a 15% improvement over the base model (28.8%) and surpassing OpenAI's O1-Preview performance with just 1.5B parameters.
84
+
85
+ ## Data
86
+ Our training dataset consists of approximately 40,000 unique problem-answer pairs compiled from:
87
+ - AIME problems (1984-2023)
88
+ - AMC problems (prior to 2023)
89
+ - Omni-MATH dataset
90
+ - Still dataset
91
+
92
+ ## Training Recipe
93
+ We employ Deepseek's Group Relative Policy Optimization (GRPO), a simplified RL algorithm that extends PPO by:
94
+ - Normalizing advantage function over all samples generated from the same prompt.
95
+ - Applying KL divergence regularization on top of PPO's surrogate loss to prevent significant policy drift.
96
+
97
+ **Reward Function**: Our reward function is simple but effective:
98
+ - 1 for correct answers passing LaTeX/Sympy checks
99
+ - 0 for incorrect or improperly formatted answers
100
+ - Note: No partial rewards (such as PRMs) or intermediate feedback.
101
+
102
+ **Iterative Context Lengthening**: A key challenge in scaling RL for reasoning is compute cost. Our approach trains models with progressively longer contexts as the model improves, thus saving monetary costs and end2end training time:
103
+ - Initial 8K Context (0-1040 steps):
104
+ - 22.9% -> 33% Pass@1 on AIME 2024
105
+ - Trained on 8 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 8 = 1024
106
+ - Extended to 16K (steps 1040-1520):
107
+ - 33% -> 43% Pass@1 on AIME 2024
108
+ - Trained on 32 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 16 = 2048
109
+ - Further extended to 24K (step 1520+):
110
+ - 38% -> 43% Pass@1 on AIME 2024
111
+ - Trained on 32 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 16 = 2048
112
+ - Significant improvements within <200 steps
113
+
114
+ A more detailed description of the training recipe can be found in our [blog post](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2).
115
+
116
+ ## Evaluation
117
+ We report Pass@1 accuracy averaged over 16 samples for each problem.
118
+ | Model | AIME 2024 | MATH 500 | AMC 2023 | Minerva Math | OlympiadBench | Avg. |
119
+ |-------|-----------|-----------|-----------|--------------|---------------|------|
120
+ | 2.5-7B-Instruct | 13.3 | 79.8 | 50.6 | 34.6 | 40.7 | 43.8 |
121
+ | rStar-Math-7B | 26.7 | 78.4 | 47.5 | - | 47.1 | - |
122
+ | Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
123
+ | Qwen2.5-7B-SimpleRL | 26.7 | 82.4 | 62.5 | <strong>39.7</strong> | 43.3 | 50.9 |
124
+ | DeepSeek-R1-Distill-Qwen-1.5B | 28.8 | 82.8 | 62.9 | 26.5 | 43.3 | 48.9 |
125
+ | Still-1.5B | 32.5 | 84.4 | 66.7 | 29.0 | 45.4 | 51.6 |
126
+ | <strong>DeepScaleR-1.5B-Preview</strong> | <strong>43.1</strong> | <strong>87.8</strong> | <strong>73.6</strong> | 30.2 | <strong>50.0</strong> | <strong>57.0</strong> |
127
+ | O1-Preview | 40.0 | 81.4 | - | - | - | - |
128
+
129
+ ## Serving DeepScaleR
130
+ Our model can be served using popular high-performance inference systems:
131
+ - vLLM
132
+ - Hugging Face Text Generation Inference (TGI)
133
+ - SGLang
134
+ - TensorRT-LLM
135
+
136
+ All these systems support the OpenAI Chat Completions API format.
137
+
138
+ ## License
139
+ This project is released under the MIT License, reflecting our commitment to open and accessible AI development.
140
+ We believe in democratizing AI technology by making our work freely available for anyone to use, modify, and build upon.
141
+ This permissive license ensures that researchers, developers, and enthusiasts worldwide can leverage and extend our work without restrictions, fostering innovation and collaboration in the AI community.
142
+
143
+ ## Acknowledgement
144
+ - Our training experiments are powered by our heavily modified fork of [Verl](https://github.com/agentica-project/verl), an open-source RLHF library.
145
+ - Our model is trained on top of [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
146
+ - Our work is done as part of [Berkeley Sky Computing Lab](https://skycomputing.berkeley.edu/) and [Berkeley AI Research](https://bair.berkeley.edu/).
147
+
148
+ ## Citation
149
+ ```bibtex
150
+ @misc{deepscaler2025,
151
+ title={DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL},
152
+ author={Michael Luo and Sijun Tan and Justin Wong and Xiaoxiang Shi and William Tang and Manan Roongta and Colin Cai and Jeffrey Luo and Tianjun Zhang and Erran Li and Raluca Ada Popa and Ion Stoica},
153
+ year={2025},
154
+ howpublished={\url{https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2}},
155
+ note={Notion Blog}
156
+ year={2025}
157
+ }