$RLinf-logo$

RLinf: Reinforcement Learning Infrastructure for Agentic AI

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

$RLinf-overview$

Model Description

The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.

We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.

Evaluation and Results

We trained and evaluated two models using RLinf:

RLinf-math-1.5B Model (based on DeepSeek-R1-Distill-Qwen-1.5B)
- Recommended sampling settings: temperature = 0.6, top_p = 0.95
RLinf-math-7B Model (based on DeepSeek-R1-Distill-Qwen-7B)
- Recommended sampling settings: temperature = 1.0, top_p = 0.95

Benchmark Results

1.5B models. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL.

Model	AIME 24	AIME 25	GPQA-diamond	Average
DeepSeek-R1-Distill-Qwen-1.5B	28.33	24.90	27.45	26.89
DeepMath-1.5B	37.80	30.42	32.11	33.44
DeepScaleR-1.5B-Preview	40.41	30.93	27.54	32.96
AReaL-1.5B-Preview-Stage-3	40.73	31.56	28.10	33.46
AReaL-1.5B-retrain*	44.42	34.27	33.81	37.50
FastCuRL-1.5B-V3	43.65	32.49	35.00	37.05
RLinf-math-1.5B	48.44	35.63	38.46	40.84

* We retrain the model using the default settings for 600 steps.

7B models. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL.

Model	AIME 24	AIME 25	GPQA-diamond	Average
DeepSeek-R1-Distill-Qwen-7B	54.90	40.20	45.48	46.86
AReaL-boba-RL-7B	61.66	49.38	46.93	52.66
Skywork-OR1-7B	66.87	52.49	44.43	54.60
Polaris-7B-Preview	68.55	51.24	43.88	54.56
AceMath-RL-Nemotron-7B	67.30	55.00	45.57	55.96
RLinf-math-7B	68.33	52.19	48.18	56.23

How to Use

Example with Hugging Face transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "RLinf/RLinf-math-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

prompt = "Solve: If x^2 + 2x + 1 = 0, what is x?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=1.0,   # recommended for 7B
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License

This code repository and the model weights are licensed under the MIT License.

RLinf
/

RLinf-math-7B