Papers
arxiv:2503.07536

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Published on Mar 10
ยท Submitted by ColeYzzzz on Mar 12
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \method, a two-stage framework adapting rule-based RL for multimodal reasoning through Foundational Reasoning Enhancement (FRE) followed by Multimodal Generalization Training (MGT). The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that \method achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

Community

Paper submitter

LMM-R1: Enhancing Multimodal Reasoning through Two-Stage Rule-Based RL

TL;DR: We present a two-stage rule-based reinforcement learning approach that significantly improves multimodal reasoning capabilities in LMMs, enabling our 3B model to outperform much larger models like Claude 3.5 and GPT-4o on agent tasks.

๐Ÿ” Key Insights

We address the challenge of limited high-quality multimodal reasoning data by introducing a two-stage training approach:

  1. Stage 1: Leverage abundant high-quality text-only reasoning data for RL training, enhancing core reasoning abilities that generalize to multimodal tasks.

  2. Stage 2: Fine-tune with limited multimodal reasoning data to improve cross-modal generalization.

This approach yields impressive results across both text-only and multimodal reasoning benchmarks, demonstrating strong generalization capabilities.

๐Ÿ“Š Surprising Results

Our 3B parameter model achieves remarkable performance:

  • Outperforms Gemini 1.5 Pro and Claude 3.5 Sonnet on football game environments
  • Surpasses GPT-4o on Sokoban puzzle-solving after minimal additional RL training

๐Ÿงช Experimental Findings

Our ablation studies reveal important insights:

  • Text-only training enhances reasoning but can degrade visual perception
  • Multimodal-only training improves perception but weakens reasoning and reduces output length
  • SFT with distilled responses fails to capture deep reasoning capabilities
  • Mixed data training shows moderate results, suggesting room for optimization

๐Ÿ› ๏ธ Technical Improvements

We've completely refactored our multimodal codebase to support the latest OpenRLHF features:

  • PackingSample implementation
  • Ring FlashAttn integration
  • Significantly reduced memory footprint

๐Ÿ“š Resources

This work demonstrates that smaller models can achieve remarkable reasoning capabilities through strategic training approaches, challenging the notion that scale is the only path to advanced AI capabilities.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.07536 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.07536 in a Space README.md to link it from this page.

Collections including this paper 1