File size: 3,310 Bytes
e0df0cd
 
 
 
 
 
 
 
d66859d
 
 
 
e0df0cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
license: apache-2.0
datasets:
- qizekun/ShapeLLM
language:
- en
---

## ShapeLLM model

This repository contains the ShapeLLM-13B model presented in [ShapeLLM: Universal 3D Object Understanding for Embodied Interaction](https://huggingface.co/papers/2402.17766).

## Install

[//]: # (If you are using Windows, do *NOT* proceed, see instructions [here](https://github.com/qizekun/LLaVA/blob/main/docs/Windows.md).)

1. Clone this repository and navigate to ShapeLLM folder
```Shell
git clone https://github.com/qizekun/ShapeLLM.git
cd ShapeLLM
```
2. Install Package
```Shell
conda create -n shapellm python=3.10 -y
conda activate shapellm
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
```
3. Install additional packages for training cases
```Shell
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```
4. Install PointNet++
```Shell
pip install "git+https://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops&subdirectory=pointnet2_ops_lib"
```


## ShapeLLM
### model weights
Please check out our [Model Zoo](https://github.com/qizekun/ShapeLLM/blob/main/docs/MODEL_ZOO.md) for all public ShapeLLM checkpoints.

### Demo
#### CLI Inference
Chat about point clouds using CLI interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference.
```Shell
python -m llava.serve.cli \
    --model-path qizekun/ShapeLLM_13B_general_v1.0 \
    --pts-file assets/instrument.npy
```

### Training
Consistent with LLaVA, we adopt a two-stage training approach. In the first stage, we solely fine-tune the projector for semantic alignment. In the second stage, we conduct full fine-tuning using Instruction Following data.
Download data following [DATA](https://github.com/qizekun/ShapeLLM/blob/main/docs/DATA.md), organize the data as follows in `./playground/data/shapellm/`,
```
β”‚playground/data/shapellm/
β”œβ”€β”€ cap3d_objaverse_785k.json
β”œβ”€β”€ cap3d_objaverse_sft_45k.json
β”œβ”€β”€ gapartnet_sft_27k_openai.json
β”œβ”€β”€ gapartnet_pcs
β”‚   β”œβ”€β”€ Box_100129_0_0.npy
β”‚   └── ...
└── cap3d_pcs
    β”œβ”€β”€ 00000054c36d44a2a483bdbff31d8edf.pt
    └── ...
```
Furthermore, ShapeLLM utilizes the Large version of [ReCon++](https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/cfgs/pretrain/large/openshape.yaml) as the point encoder.
You need to download the [ReCon++ weight](https://huggingface.co/qizekun/ReConV2/blob/main/zeroshot/large/best_lvis.pth) and save it to `./checkpoints/recon/large.pth`.
```
β”‚checkpoints/recon/
└── large.pth
```
**1. Feature Alignment Stage**
```
sh scripts/pretrain.sh
```
**2. Visual Instruction Tuning Stage**
```
sh scripts/finetune.sh
```
The training takes around 14 hours for ShapeLLM-13B on 8x A100 (80G). It takes around 7 hours for ShapeLLM-7B.

### Zero-shot Understanding on 3D MM-Vet
Evaluate 3D MLLMs for integrated capabilities and embodied interaction capabilities, run the script:
```
sh scripts/eval/mmvet.sh
```
Using GPT4 to calulate the 3D MM-Vet score:
```
sh scripts/eval/eval_mmvet.sh
```

### Visual Grounding on GApartNet
Evaluate the performance of ShapeLLM on the GApartNet dataset, run the script:
```
sh scripts/eval/gapartnet_ref.sh
```
Calucate the generative 3D visual grounding accuracy:
```
sh scripts/eval/eval_gapartnet.sh
```