arxiv:2506.23918

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Published on Jun 30

· Submitted by

Warrieryes on Jul 4

#2 Paper of the day

Upvote

Authors:

Zhaochen Su ,

Peng Xia ,

Yan Ma ,

Xiaoye Qu ,

Jiaqi Liu ,

Abstract

The survey outlines the evolution of multimodal AI from treating vision as static context to integrating it dynamically into the reasoning process, highlighting three stages and key challenges.

AI-generated summary

Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.

View arXiv page View PDF GitHub 902 Add to collection

Community

Warrieryes

Paper author Paper submitter Jul 4

This survey provides a foundational framework for the "Think with Image" paradigm, which moves beyond static visual perception to active, multi-step visual reasoning. The survey organizes the field into a three-stage evolution of increasing cognitive autonomy: from leveraging external tools, to programmatically generating visual operations, and finally to performing intrinsic visual imagination. By systematically analyzing the core methodologies, applications, and challenges associated with each stage, this work aims to offer a roadmap for developing the next generation of multimodal AI.