arxiv:2507.05259

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Published on Jul 7

· Submitted by

danielchyeh on Jul 8

Upvote

Authors:

Chun-Hsiao Yeh ,

Abstract

X-Planner, a Multimodal Large Language Model-based system, uses chain-of-thought reasoning to interpret complex instructions and generate precise edits, achieving state-of-the-art results in image editing.

AI-generated summary

Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.

View arXiv page View PDF Project page Add to collection

Community

danielchyeh

Paper author Paper submitter Jul 8

A MLLM planner to decompose complex text-guided image editing instructions into precise sub-instructions with control guidances, and ensure localized, identity-preserving edits.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.05259 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.05259 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.05259 in a Space README.md to link it from this page.