DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Abstract
Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models.
Community
Arxiv: https://arxiv.org/abs/2502.17157
Project page: https://aim-uofa.github.io/Diception/
HuggingFace Gradio space: https://huggingface.co/spaces/Canyu/Diception-Demo
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Decoder-Only LLMs are Better Controllers for Diffusion Models (2025)
- IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models (2025)
- Dual Diffusion for Unified Image Generation and Understanding (2024)
- End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings (2025)
- ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval (2025)
- Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models (2025)
- A Survey on Pre-Trained Diffusion Model Distillations (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper