RLinf: Reinforcement Learning Infrastructure for Agentic AI
RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

Model Description
The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.
We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
Evaluation and Results
We trained and evaluated two models using RLinf:
RLinf-math-1.5B Model (based on DeepSeek-R1-Distill-Qwen-1.5B)
- Recommended sampling settings:
temperature = 0.6
,top_p = 0.95
- Recommended sampling settings:
RLinf-math-7B Model (based on DeepSeek-R1-Distill-Qwen-7B)
- Recommended sampling settings:
temperature = 1.0
,top_p = 0.95
- Recommended sampling settings:
Benchmark Results
1.5B models. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL.
Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | 28.33 | 24.90 | 27.45 | 26.89 |
DeepMath-1.5B | 37.80 | 30.42 | 32.11 | 33.44 |
DeepScaleR-1.5B-Preview | 40.41 | 30.93 | 27.54 | 32.96 |
AReaL-1.5B-Preview-Stage-3 | 40.73 | 31.56 | 28.10 | 33.46 |
AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 | 37.50 |
FastCuRL-1.5B-V3 | 43.65 | 32.49 | 35.00 | 37.05 |
RLinf-math-1.5B | 48.44 | 35.63 | 38.46 | 40.84 |
* We retrain the model using the default settings for 600 steps.
7B models. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL.
Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-7B | 54.90 | 40.20 | 45.48 | 46.86 |
AReaL-boba-RL-7B | 61.66 | 49.38 | 46.93 | 52.66 |
Skywork-OR1-7B | 66.87 | 52.49 | 44.43 | 54.60 |
Polaris-7B-Preview | 68.55 | 51.24 | 43.88 | 54.56 |
AceMath-RL-Nemotron-7B | 67.30 | 55.00 | 45.57 | 55.96 |
RLinf-math-7B | 68.33 | 52.19 | 48.18 | 56.23 |
How to Use
Example with Hugging Face transformers
:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "RLinf/RLinf-math-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
prompt = "Solve: If x^2 + 2x + 1 = 0, what is x?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=1.0, # recommended for 7B
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
License
This code repository and the model weights are licensed under the MIT License.
- Downloads last month
- 73
Model tree for RLinf/RLinf-math-7B
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-7BEvaluation results
- accuracy on AIME24self-reported68.328
- accuracy on AIME25self-reported52.194
- accuracy on GPQA-diamondself-reported48.178