arxiv:2508.21113

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

Published on Aug 28

· Submitted by

YannQi on Sep 1

#1 Paper of the day

Upvote

Authors:

Abstract

R-4B, an auto-thinking multimodal large language model, uses bi-mode annealing and Bi-mode Policy Optimization to adaptively decide on problem-solving strategies, achieving state-of-the-art performance with lower computational cost.

AI-generated summary

Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.

View arXiv page View PDF GitHub 37 Add to collection

Community

YannQi

Paper submitter 1 day ago

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

[📚 Arxiv Paper] [🤗 Hugging Face] [💻 Code]

We present R-4B, a multimodal large language model designed for general-purpose auto-thinking, autonomously switching between step-by-step thinking and direct response generation based on task complexity. This capability enables R-4B to deliver high-quality responses while significantly improving inference efficiency and reducing computational costs.

The development of R-4B follows a two-stage training paradigm:
(1) Bi-mode Annealing, which establishes both thinking and non-thinking capabilities for VQA; and
(2) Bi-mode Policy Optimization (BPO), which enables the model to adaptively switch between thinking and non-thinking modes based on input demands.

🚀 Key Features

🧠 Think Smart, Act Fast: Adaptive & Controllable Thinking!
Our model provides three-mode control over the response process.
- Auto-thinking Mode: Unleash auto-thinking that works across general topics, from simple Q&A to complex scientific analysis. It saves time and computation by thinking only when it matters.
- Support Manual Control: Explicitly command the model to use its thinking or non-thinking capabilities, enabling you to make your choices for every job.
🏆 Strong Performance, Open for Everyone!
Our model is now fully open-source. It achieves state-of-the-art performance among models of comparable size.

📢 News

[2025.08.20] 🚀 vLLM Support is Here! Our R-4B model is now fully compatible with vLLM for high-performance inference.
[2025.08.18] 🏆 Top Rank Achieved! We are thrilled to announce that R-4B is now ranked #1 among all open-source models on the OpenCompass Multi-modal Reasoning Leaderboard!
[2025.08.11] 🥇 Rank #1! R-4B ranks first under 20B parameters on the OpenCompass Multi-modal Academic Leaderboard!
[2025.08.05] 🎉 R-4B is Released! Our model is now publicly available. You can download it from Hugging Face.