Papers
arxiv:2507.00748

Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

Published on Jul 1
Authors:
,
,
,
,
,

Abstract

A reinforcement learning-based post-training strategy improves multimodal large language models' performance in multi-image grounding tasks through chain-of-thought data synthesis, low-rank adaptation, and rule-based reinforcement learning.

AI-generated summary

Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications that involve complex multi-image compositions and multi-modal instructions, revealing limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, yielding improvements of +9.04% on MIG-Bench, +6.37% on MC-Bench, and +4.98% on several out-of-domain reasoning grounding benchmarks compared to the SFT baseline. Furthermore, our method exhibits strong generalization in multi-image perception, with gains of +3.1% and +2.4% over the base model on BLINK and MMIU benchmarks, respectively.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.00748 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.00748 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.00748 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.