arxiv:2501.09194

Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation

Published on Jan 15

Authors:

Abstract

Text-to-image (T2I) generative diffusion models have demonstrated outstanding performance in synthesizing diverse, high-quality visuals from text captions. Several layout-to-image models have been developed to control the generation process by utilizing a wide range of layouts, such as segmentation maps, edges, and human keypoints. In this work, we propose ObjectDiffusion, a model that conditions T2I diffusion models on semantic and spatial grounding information, enabling the precise rendering and placement of desired objects in specific locations defined by bounding boxes. To achieve this, we make substantial modifications to the network architecture introduced in ControlNet to integrate it with the grounding method proposed in GLIGEN. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model improves the precision and quality of controllable image generation, achieving an AP_{50} of 46.6, an AR of 44.5, and an FID of 19.8, outperforming the current SOTA model trained on open-source datasets across all three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding capabilities in closed-set and open-set vocabulary settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple detailed objects in varying sizes, forms, and locations.

View arXiv page View PDF Add to collection

Community

asuleyman

28 days ago

•

edited 28 days ago

In this work, we propose ObjectDiffusion, a cutting-edge model that conditions text-to-image generative diffusion models on semantic and spatial grounding information, enabling precise rendering and placement of objects in specific locations. By integrating the robust architecture of ControlNet with the grounding techniques of GLIGEN, we significantly improve both the precision and quality of controlled image generation. Our model outperforms current state-of-the-art models trained on open-source datasets, achieving notable improvements in FID, AP50, and AR metrics on the COCO2017 dataset. Quantitative and qualitative evaluations demonstrate the capabilities of ObjectDiffusion to synthesize diverse, high-quality, high-fidelity images that consistently align with the specified control layout.

Paper link: https://arxiv.org/abs/2501.09194

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.09194 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.09194 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.09194 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.