Papers
arxiv:2506.23918

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Published on Jun 30
ยท Submitted by Warrieryes on Jul 4
#2 Paper of the day
Authors:
,
,
Yan Ma ,
,
,
,
,
,
,
,

Abstract

The survey outlines the evolution of multimodal AI from treating vision as static context to integrating it dynamically into the reasoning process, highlighting three stages and key challenges.

AI-generated summary

Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.

Community

Paper author Paper submitter

This survey provides a foundational framework for the "Think with Image" paradigm, which moves beyond static visual perception to active, multi-step visual reasoning. The survey organizes the field into a three-stage evolution of increasing cognitive autonomy: from leveraging external tools, to programmatically generating visual operations, and finally to performing intrinsic visual imagination. By systematically analyzing the core methodologies, applications, and challenges associated with each stage, this work aims to offer a roadmap for developing the next generation of multimodal AI.

Paper author Paper submitter

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.23918 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.23918 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.23918 in a Space README.md to link it from this page.

Collections including this paper 17