Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
Abstract
Using video data to provide commonsense priors enhances 3D asset generation, enabling spatial consistency and semantic plausibility in 3D content creation.
Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.
Community
We propose a novel method to enhance 3D content generation by leveraging video generation technology. To this end, we have constructed a large-scale multi-view 3D dataset, Droplet3D-4M, based on Objaverse-XL. Subsequently, we successfully trained a corresponding 3D generative model using the video generation backbone model, DropletVideo. Our technical solution, model weights, and dataset have now been fully open-sourced.
paper: https://www.arxiv.org/abs/2508.20470
Github: https://github.com/IEIT-AGI/Droplet3D
Project: https://dropletx.github.io/
Model Weight: https://huggingface.co/DropletX/Droplet3D-5B
Droplet3D-4M: https://huggingface.co/datasets/DropletX/Droplet3D-4M
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SeqTex: Generate Mesh Textures in Video Sequence (2025)
- Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity (2025)
- ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models (2025)
- 4DNeX: Feed-Forward 4D Generative Modeling Made Easy (2025)
- Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation (2025)
- HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels (2025)
- Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper