Collaborative Multi-Modal Coding for High-Quality 3D Generation
Abstract
TriMM, a feed-forward 3D-native generative model, integrates multi-modal data (RGB, RGBD, point clouds) to enhance 3D asset generation quality and robustness.
3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.
Community
3D content inherently possesses multi-modal characteristics and can be represented across different modalities (e.g., RGB images, RGBD, and point clouds). Each modality offers distinct advantages for 3D asset modeling: RGB images capture vivid textures, while point clouds provide fine-grained geometric structures. In this paper, we present TriMM, a feed-forward 3D-native generative model that explores the potential of leveraging multi-modalities for 3D modeling.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors (2025)
- From One to More: Contextual Part Latents for 3D Generation (2025)
- LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion (2025)
- TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP (2025)
- MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation (2025)
- MVG4D: Image Matrix-Based Multi-View and Motion Generation for 4D Content Creation from a Single Image (2025)
- 4DNeX: Feed-Forward 4D Generative Modeling Made Easy (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper