RLinf-logo

RLinf: Reinforcement Learning Infrastructure for Agentic AI

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

RLinf-overview

Model Description

The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.

We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.

Evaluation and Results

We trained and evaluated two models using RLinf:

  • RLinf-math-1.5B Model (based on DeepSeek-R1-Distill-Qwen-1.5B)

    • Recommended sampling settings: temperature = 0.6, top_p = 0.95
  • RLinf-math-7B Model (based on DeepSeek-R1-Distill-Qwen-7B)

    • Recommended sampling settings: temperature = 1.0, top_p = 0.95

Benchmark Results

1.5B models. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL.

Model AIME 24 AIME 25 GPQA-diamond Average
DeepSeek-R1-Distill-Qwen-1.5B 28.33 24.90 27.45 26.89
DeepMath-1.5B 37.80 30.42 32.11 33.44
DeepScaleR-1.5B-Preview 40.41 30.93 27.54 32.96
AReaL-1.5B-Preview-Stage-3 40.73 31.56 28.10 33.46
AReaL-1.5B-retrain* 44.42 34.27 33.81 37.50
FastCuRL-1.5B-V3 43.65 32.49 35.00 37.05
RLinf-math-1.5B 48.44 35.63 38.46 40.84

* We retrain the model using the default settings for 600 steps.

7B models. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL.

Model AIME 24 AIME 25 GPQA-diamond Average
DeepSeek-R1-Distill-Qwen-7B 54.90 40.20 45.48 46.86
AReaL-boba-RL-7B 61.66 49.38 46.93 52.66
Skywork-OR1-7B 66.87 52.49 44.43 54.60
Polaris-7B-Preview 68.55 51.24 43.88 54.56
AceMath-RL-Nemotron-7B 67.30 55.00 45.57 55.96
RLinf-math-7B 68.33 52.19 48.18 56.23

How to Use

Example with Hugging Face transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "RLinf/RLinf-math-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

prompt = "Solve: If x^2 + 2x + 1 = 0, what is x?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=1.0,   # recommended for 7B
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License

This code repository and the model weights are licensed under the MIT License.

Downloads last month
73
Safetensors
Model size
7.62B params
Tensor type
F16
·
Video Preview
loading

Model tree for RLinf/RLinf-math-7B

Finetuned
(200)
this model
Quantizations
2 models

Evaluation results