R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Abstract
R-4B, an auto-thinking multimodal large language model, uses bi-mode annealing and Bi-mode Policy Optimization to adaptively decide on problem-solving strategies, achieving state-of-the-art performance with lower computational cost.
Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.
Community
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
[๐ Arxiv Paper] [๐ค Hugging Face] [๐ป Code]
We present R-4B, a multimodal large language model designed for general-purpose auto-thinking, autonomously switching between step-by-step thinking and direct response generation based on task complexity. This capability enables R-4B to deliver high-quality responses while significantly improving inference efficiency and reducing computational costs.
The development of R-4B follows a two-stage training paradigm:
(1) Bi-mode Annealing, which establishes both thinking and non-thinking capabilities for VQA; and
(2) Bi-mode Policy Optimization (BPO), which enables the model to adaptively switch between thinking and non-thinking modes based on input demands.
๐ Key Features
๐ง Think Smart, Act Fast: Adaptive & Controllable Thinking!
Our model provides three-mode control over the response process.- Auto-thinking Mode: Unleash auto-thinking that works across general topics, from simple Q&A to complex scientific analysis. It saves time and computation by thinking only when it matters.
- Support Manual Control: Explicitly command the model to use its
thinking
ornon-thinking
capabilities, enabling you to make your choices for every job.
๐ Strong Performance, Open for Everyone!
Our model is now fully open-source. It achieves state-of-the-art performance among models of comparable size.
๐ข News
- [2025.08.20] ๐ vLLM Support is Here! Our R-4B model is now fully compatible with vLLM for high-performance inference.
- [2025.08.18] ๐ Top Rank Achieved! We are thrilled to announce that R-4B is now ranked #1 among all open-source models on the OpenCompass Multi-modal Reasoning Leaderboard!
- [2025.08.11] ๐ฅ Rank #1! R-4B ranks first under 20B parameters on the OpenCompass Multi-modal Academic Leaderboard!
- [2025.08.05] ๐ R-4B is Released! Our model is now publicly available. You can download it from Hugging Face.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- KAT-V1: Kwai-AutoThink Technical Report (2025)
- Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models (2025)
- Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner (2025)
- Towards Concise and Adaptive Thinking in Large Reasoning Models: A Survey (2025)
- JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models (2025)
- Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards (2025)
- M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper