arxiv:2603.25319

MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

Published on Mar 26

· Submitted by

zhekai chen on Mar 27

The University of Hong Kong

Upvote

Authors:

Abstract

A large-scale dataset and benchmark are introduced to address limitations in multi-reference image generation by providing structured long-context supervision and standardized evaluation protocols.

AI-generated summary

Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.

View arXiv page View PDF Project page GitHub 36 Add to collection

Community

Azily

Paper submitter 1 day ago

We present MACRO, a large-scale multi-reference image generation dataset MacroData with 400K samples and the corresponding multi-image generation metric MacroBench. Our dataset supports the input of up to 10 reference maps, covering the four long-context task dimensions of customization, illustration, spatial and temporal. It can effectively solve the performance degradation problem faced by the current model when dealing with multi-reference inputs.