wingrune nielsr HF Staff commited on
Commit
50e9f64
·
verified ·
1 Parent(s): ae08151

Enhance model card for 3DGraphLLM with metadata, abstract, performance, and usage (#1)

Browse files

- Enhance model card for 3DGraphLLM with metadata, abstract, performance, and usage (29402d8bd9597fdbce2218ff42f1a12c0d66dbc6)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +48 -3
README.md CHANGED
@@ -1,18 +1,55 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
3
  ---
 
4
  # 3DGraphLLM
5
 
6
- 3DGraphLLM is a model that uses a 3D scene graph and an LLM to perform 3D vision-language tasks.
 
 
 
 
 
7
 
8
  <p align="center">
9
  <img src="ga.png" width="80%">
10
  </p>
11
 
 
 
12
 
13
  ## Model Details
14
 
15
- We provide our best checkpoint that uses [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as an LLM, [Mask3D](https://github.com/JonasSchult/Mask3D) 3D instance segmentation to get scene graph nodes, [VL-SAT](https://github.com/wz7in/CVPR2023-VLSAT) to encode semantic relations [Uni3D](https://github.com/baaivision/Uni3D) as 3D object encoder, and [DINOv2](https://github.com/facebookresearch/dinov2) as 2D object encoder.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## Citation
18
  If you find 3DGraphLLM helpful, please consider citing our work as:
@@ -26,4 +63,12 @@ If you find 3DGraphLLM helpful, please consider citing our work as:
26
  primaryClass={cs.CV},
27
  url={https://arxiv.org/abs/2412.18450},
28
  }
29
- ```
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ tags:
6
+ - 3d-scene-understanding
7
+ - scene-graph
8
+ - multimodal
9
+ - vlm
10
+ - llama
11
+ - vision-language-model
12
  ---
13
+
14
  # 3DGraphLLM
15
 
16
+ 3DGraphLLM is a model that combines semantic graphs and large language models for 3D scene understanding. It aims to improve 3D vision-language tasks by explicitly incorporating semantic relationships into a learnable representation of a 3D scene graph, which is then used as input to LLMs.
17
+
18
+ This model was presented in the paper:
19
+ [**3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding**](https://huggingface.co/papers/2412.18450)
20
+
21
+ The official code is publicly available at: [https://github.com/CognitiveAISystems/3DGraphLLM](https://github.com/CognitiveAISystems/3DGraphLLM)
22
 
23
  <p align="center">
24
  <img src="ga.png" width="80%">
25
  </p>
26
 
27
+ ## Abstract
28
+ A 3D scene graph represents a compact scene model by capturing both the objects present and the semantic relationships between them, making it a promising structure for robotic applications. To effectively interact with users, an embodied intelligent agent should be able to answer a wide range of natural language queries about the surrounding 3D environment. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for learning scene representations have shown that adapting these representations to the 3D world can significantly improve the quality of LLM responses. However, existing methods typically rely only on geometric information, such as object coordinates, and overlook the rich semantic relationships between objects. In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph that explicitly incorporates semantic relationships. This representation is used as input to LLMs for performing 3D vision-language tasks. In our experiments on popular ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate that our approach outperforms baselines that do not leverage semantic relationships between objects.
29
 
30
  ## Model Details
31
 
32
+ We provide our best checkpoint that uses [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as an LLM, [Mask3D](https://github.com/JonasSchult/Mask3D) 3D instance segmentation to get scene graph nodes, [VL-SAT](https://github.com/wz7in/CVPR2023-VLSAT) to encode semantic relations [Uni3D](https://github.com/baaivision/Uni3D) as 3D object encoder, and [DINOv2](https://github.com/facebookresearch/dinov2) as 2D object encoder.
33
+
34
+ ## Performance
35
+ Semantic relations boost LLM performance on 3D Referred Object Grounding and Dense Scene Captioning tasks.
36
+
37
+ | | [ScanRefer](https://github.com/daveredrum/ScanRefer) | | [Multi3dRefer](https://github.com/3dlg-hcvc/M3DRef-CLIP) | | [Scan2Cap](https://github.com/daveredrum/Scan2Cap) | | [ScanQA](https://github.com/ATR-DBI/ScanQA) | | [SQA3D](https://github.com/SilongYong/SQA3D) |
38
+ |:----: |:---------: |:-------: |:------: |:------: |:---------: |:----------: |:------------: |:------: |:-----: |
39
40
+ | [Chat-Scene](https://github.com/ZzZZCHS/Chat-Scene/tree/dev) | 55.5 | 50.2 | 57.1 | 52.3 | 77.1 | 36.3 | **87.7** | **14.3** | <ins>54.6</ins> |
41
+ | <ins>3DGraphLLM Vicuna-1.5 </ins> | <ins>58.6</ins> | <ins>53.0</ins> | <ins>61.9</ins> | <ins>57.3</ins> | <ins>79.2</ins> | <ins>34.7</ins> | <ins>91.2</ins> | 13.7 | 55.1 |
42
+ | **3DGraphLLM LLAMA3-8B** | **62.4** | **56.6** | **64.7** | **59.9** | **81.0** | **36.5** | 88.8 | <ins>15.9</ins> | **55.9** |
43
+
44
+ ## Usage
45
+
46
+ For detailed instructions on environment preparation, downloading LLM backbones, data preprocessing, training, and inference, please refer to the [official GitHub repository](https://github.com/CognitiveAISystems/3DGraphLLM).
47
+
48
+ You can run their interactive demo by following the instructions on their GitHub, or try the simplified command below:
49
+ ```bash
50
+ bash demo/run_demo.sh
51
+ ```
52
+ This will prompt you to ask different queries about Scene 435 of ScanNet.
53
 
54
  ## Citation
55
  If you find 3DGraphLLM helpful, please consider citing our work as:
 
63
  primaryClass={cs.CV},
64
  url={https://arxiv.org/abs/2412.18450},
65
  }
66
+ ```
67
+
68
+ ## Acknowledgement
69
+ Thanks to the open source of the following projects:
70
+
71
+ [Chat-Scene](https://github.com/ZzZZCHS/Chat-Scene/tree/dev)
72
+
73
+ ## Contact
74
+ If you have any questions about the project, please open an issue in the [GitHub repository](https://github.com/CognitiveAISystems/3DGraphLLM) or send an email to [Tatiana Zemskova]([email protected]).