BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

The Architecture Diagram of BIP3D, where the red stars indicate the parts that have been modified or added compared to the base model, GroundingDINO, and dashed lines indicate optional elements.
Results on EmbodiedScan Benchmark
We made several improvements based on the original paper, achieving better 3D perception results. The main improvements include the following two points:
- New Fusion Operation: We enhanced the decoder by replacing the deformable aggregation (DAG) with a 3D deformable attention mechanism (DAT). Specifically, we improved the feature sampling process by transitioning from bilinear interpolation to trilinear interpolation, which leverages depth distribution for more accurate feature extraction.
- Mixed Data Training: To optimize the grounding model's performance, we adopted a mixed-data training strategy by integrating detection data with grounding data during the grounding finetuning process.
1. Results on Multi-view 3D Detection Validation Dataset
Model | Inputs | Op | Overall | Head | Common | Tail | Small | Medium | Large | ScanNet | 3RScan | MP3D | ckpt | log |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BIP3D | RGB | DAG | 16.57 | 23.29 | 13.84 | 12.29 | 2.67 | 17.85 | 12.89 | 19.71 | 26.76 | 8.50 | - | - |
BIP3D | RGB | DAT | 16.67 | 22.41 | 14.19 | 13.18 | 3.32 | 17.25 | 14.89 | 20.80 | 24.18 | 9.91 | - | - |
BIP3D | RGB-D | DAG | 22.53 | 28.89 | 20.51 | 17.83 | 6.95 | 24.21 | 15.46 | 24.77 | 35.29 | 10.34 | - | - |
BIP3D | RGB-D | DAT | 23.24 | 31.51 | 20.20 | 17.62 | 7.31 | 24.09 | 15.82 | 26.35 | 36.29 | 11.44 | - | - |
2. Results on Multi-view 3D Grounding Mini Dataset
Model | Inputs | Op | Overall | Easy | Hard | View-dep | View-indep | ScanNet | 3RScan | MP3D | ckpt | log |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BIP3D | RGB | DAG | 44.00 | 44.39 | 39.56 | 46.05 | 42.92 | 48.62 | 42.47 | 36.40 | - | - |
BIP3D | RGB | DAT | 44.43 | 44.74 | 41.02 | 45.17 | 44.04 | 49.70 | 41.81 | 37.28 | - | - |
BIP3D | RGB-D | DAG | 45.79 | 46.22 | 40.91 | 45.93 | 45.71 | 48.94 | 46.61 | 37.36 | - | - |
BIP3D | RGB-D | DAT | 58.47 | 59.02 | 52.23 | 60.20 | 57.56 | 66.63 | 54.79 | 46.72 | - | - |
3. Results on Multi-view 3D Grounding Validation Dataset
Model | Inputs | Op | Mixed Data | Overall | Easy | Hard | View-dep | View-indep | ScanNet | 3RScan | MP3D | ckpt | log |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BIP3D | RGB | DAG | No | 45.81 | 46.21 | 41.34 | 47.07 | 45.09 | 50.40 | 47.53 | 32.97 | - | - |
BIP3D | RGB | DAT | No | 47.29 | 47.82 | 41.42 | 48.58 | 46.56 | 52.74 | 47.85 | 34.60 | - | - |
BIP3D | RGB-D | DAG | No | 53.75 | 53.87 | 52.43 | 55.21 | 52.93 | 60.05 | 54.92 | 38.20 | - | - |
BIP3D | RGB-D | DAT | No | 61.36 | 61.88 | 55.58 | 62.43 | 60.76 | 66.96 | 62.75 | 46.92 | - | - |
BIP3D | RGB-D | DAT | Yes | 66.58 | 66.99 | 62.07 | 67.95 | 65.81 | 72.43 | 68.26 | 51.14 | - | - |
4. Results on Multi-view 3D Grounding Test Dataset
Model | Overall | Easy | Hard | View-dep | View-indep | ckpt | log |
---|---|---|---|---|---|---|---|
EmbodiedScan | 39.67 | 40.52 | 30.24 | 39.05 | 39.94 | - | - |
SAG3D* | 46.92 | 47.72 | 38.03 | 46.31 | 47.18 | - | - |
DenseG* | 59.59 | 60.39 | 50.81 | 60.50 | 59.20 | - | - |
BIP3D | 67.38 | 68.12 | 59.08 | 67.88 | 67.16 | - | - |
BIP3D-Base | 70.53 | 71.22 | 62.91 | 70.69 | 70.47 | - | - |
Citation
@article{lin2024bip3d,
title={BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence},
author={Lin, Xuewu and Lin, Tianwei and Huang, Lichao and Xie, Hongyu and Su, Zhizhong},
journal={arXiv preprint arXiv:2411.14869},
year={2024}
}
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.