Spaces:
Paused
Paused
alessandro trinca tornidor
commited on
Commit
·
f5f1590
1
Parent(s):
a170680
[feat] update README.md to prepare merge
Browse files
README.md
CHANGED
|
@@ -7,4 +7,317 @@ sdk: docker
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
(Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference)
|
| 11 |
+
|
| 12 |
+
[](http://103.170.5.190:7860/)
|
| 13 |
+
[](https://openxlab.org.cn/apps/detail/openxlab-app/LISA)
|
| 14 |
+
|
| 15 |
+
# LISA: Reasoning Segmentation via Large Language Model
|
| 16 |
+
|
| 17 |
+
<font size=7><div align='center'><b>LISA</b>: Large <b>L</b>anguage <b>I</b>nstructed <b>S</b>egmentation <b>A</b>ssistant</div></font>
|
| 18 |
+
|
| 19 |
+
<font size=7><div align='center' > <a href=https://arxiv.org/pdf/2308.00692.pdf>**Paper**</a> | <a href="https://huggingface.co/xinlai">**Models**</a> | [**Training**](#training) | [**Inference**](#inference) | [**Local Deployment**](#deployment) | [**Dataset**](#dataset) | <a href="http://103.170.5.190:7860/">**Online Demo**</a></div></font>
|
| 20 |
+
|
| 21 |
+
<!-- <p align="center"> <img src="resources/imgs/teaser.jpg" width="100%"> </p> -->
|
| 22 |
+
|
| 23 |
+
<table class="center">
|
| 24 |
+
<tr>
|
| 25 |
+
<td style="text-align:center;"><b>Input</b></td>
|
| 26 |
+
<td style="text-align:center;"><b>Output</b></td>
|
| 27 |
+
<td style="text-align:center;"><b>Input</b></td>
|
| 28 |
+
<td style="text-align:center;"><b>Output</b></td>
|
| 29 |
+
</tr>
|
| 30 |
+
<tr>
|
| 31 |
+
<td><img src="resources/imgs/obama.jpg"></td>
|
| 32 |
+
<td><img src="vis_output/obama.jpg"></td>
|
| 33 |
+
<td><img src="resources/imgs/trump.jpg"></td>
|
| 34 |
+
<td><img src="vis_output/trump.jpg"></td>
|
| 35 |
+
</tr>
|
| 36 |
+
<tr>
|
| 37 |
+
<td width=25% style="text-align:center;color:gray;">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain the reason."</td>
|
| 38 |
+
<td width=25% style="text-align:center;">"Sure, the segmentation result is [SEG]. The President of the United States in the image is <ins>President Obama.</ins>”</td>
|
| 39 |
+
<td width=25% style="text-align:center;color:gray">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain why."</td>
|
| 40 |
+
<td width=25% style="text-align:center;">"Sure, [SEG]. In the image, the President of the United States is <ins>President Trump.</ins>"</td>
|
| 41 |
+
</tr>
|
| 42 |
+
|
| 43 |
+
<tr>
|
| 44 |
+
<td><img src="resources/imgs/stand_higher.jpg"></td>
|
| 45 |
+
<td><img src="vis_output/stand_higher.jpg"></td>
|
| 46 |
+
<td><img src="resources/imgs/camera_lens.jpg"></td>
|
| 47 |
+
<td><img src="vis_output/camera_lens.jpg"></td>
|
| 48 |
+
</tr>
|
| 49 |
+
<tr>
|
| 50 |
+
<td width=25% style="text-align:center;color:gray;">"What can <ins>make the woman stand higher</ins>? Please output segmentation mask and explain why."</td>
|
| 51 |
+
<td width=25% style="text-align:center;">"Sure, [SEG]. The woman is <ins>standing higher by using a ladder</ins>..."</td>
|
| 52 |
+
<td width=25% style="text-align:center;color:gray">"Can you segment <ins>the camera lens that is more suitable for photographing nearby objects</ins> in this image?"</td>
|
| 53 |
+
<td width=25% style="text-align:center;">"Sure, [SEG]."</td>
|
| 54 |
+
</tr>
|
| 55 |
+
|
| 56 |
+
<tr>
|
| 57 |
+
<td><img src="resources/imgs/dog_with_horn.jpg"></td>
|
| 58 |
+
<td><img src="vis_output/dog_with_horn.jpg"></td>
|
| 59 |
+
<td><img src="resources/imgs/wash_hands.jpg"></td>
|
| 60 |
+
<td><img src="vis_output/wash_hands.jpg"></td>
|
| 61 |
+
</tr>
|
| 62 |
+
<tr>
|
| 63 |
+
<td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the unusual part</ins> in this image and explain why."</td>
|
| 64 |
+
<td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the unusual part is <ins>the dog wearing a reindeer antler headband</ins>..."</td>
|
| 65 |
+
<td width=25% style="text-align:center;color:gray">"Where to <ins>wash hands</ins> in this image? Please output segmentation mask."</td>
|
| 66 |
+
<td width=25% style="text-align:center;">"Sure, [SEG]."</td>
|
| 67 |
+
</tr>
|
| 68 |
+
|
| 69 |
+
<tr>
|
| 70 |
+
<td><img src="resources/imgs/jackma.jpg"></td>
|
| 71 |
+
<td><img src="vis_output/jackma.jpg"></td>
|
| 72 |
+
<td><img src="resources/imgs/blackpink.jpg"></td>
|
| 73 |
+
<td><img src="vis_output/blackpink.jpg"></td>
|
| 74 |
+
</tr>
|
| 75 |
+
<tr>
|
| 76 |
+
<td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the founder of Alibaba</ins> in this image and explain why?"</td>
|
| 77 |
+
<td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the man wearing a red tie and sitting on one of the chairs is <ins>Jack Ma</ins>, the co-founder of Alibaba Group..."</td>
|
| 78 |
+
<td width=25% style="text-align:center;color:gray">"Please segment <ins>Lisa</ins> in this figure."</td>
|
| 79 |
+
<td width=25% style="text-align:center;">"Sure, [SEG]."</td>
|
| 80 |
+
</tr>
|
| 81 |
+
</table>
|
| 82 |
+
|
| 83 |
+
<p align="center"> <img src="resources/imgs/fig_overview.jpg" width="100%"> </p>
|
| 84 |
+
|
| 85 |
+
## News
|
| 86 |
+
- [x] [2023.8.30] Release three new models [LISA-7B-v1](https://huggingface.co/xinlai/LISA-7B-v1), [LISA-7B-v1-explanatory](https://huggingface.co/xinlai/LISA-7B-v1-explanatory), and [LISA-13B-llama2-v1-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v1-explanatory). Welcome to check them out!
|
| 87 |
+
- [x] [2023.8.23] Refactor code, and release new model [LISA-13B-llama2-v1](https://huggingface.co/xinlai/LISA-13B-llama2-v1). Welcome to check it out!
|
| 88 |
+
- [x] [2023.8.9] Training code is released!
|
| 89 |
+
- [x] [2023.8.4] [Online Demo](http://103.170.5.190:7860/) is released!
|
| 90 |
+
- [x] [2023.8.4] [*ReasonSeg* Dataset](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing) and the [LISA-13B-llama2-v0-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explanatory) model are released!
|
| 91 |
+
- [x] [2023.8.3] Inference code and the [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) model are released. Welcome to check them out!
|
| 92 |
+
- [x] [2023.8.2] [Paper](https://arxiv.org/pdf/2308.00692.pdf) is released and GitHub repo is created.
|
| 93 |
+
|
| 94 |
+
**LISA: Reasoning Segmentation via Large Language Model [[Paper](https://arxiv.org/abs/2308.00692)]** <br />
|
| 95 |
+
[Xin Lai](https://scholar.google.com/citations?user=tqNDPA4AAAAJ&hl=zh-CN),
|
| 96 |
+
[Zhuotao Tian](https://scholar.google.com/citations?user=mEjhz-IAAAAJ&hl=en),
|
| 97 |
+
[Yukang Chen](https://scholar.google.com/citations?user=6p0ygKUAAAAJ&hl=en),
|
| 98 |
+
[Yanwei Li](https://scholar.google.com/citations?user=I-UCPPcAAAAJ&hl=zh-CN),
|
| 99 |
+
[Yuhui Yuan](https://scholar.google.com/citations?user=PzyvzksAAAAJ&hl=en),
|
| 100 |
+
[Shu Liu](https://scholar.google.com.hk/citations?user=BUEDUFkAAAAJ&hl=zh-CN),
|
| 101 |
+
[Jiaya Jia](https://scholar.google.com/citations?user=XPAkzTEAAAAJ&hl=en)<br />
|
| 102 |
+
|
| 103 |
+
## Abstract
|
| 104 |
+
In this work, we propose a new segmentation task --- ***reasoning segmentation***. The task is designed to output a segmentation mask given a complex and implicit query text. We establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: Large-language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks.
|
| 105 |
+
For more details, please refer to the [paper](https://arxiv.org/abs/2308.00692).
|
| 106 |
+
|
| 107 |
+
## Highlights
|
| 108 |
+
**LISA** unlocks the new segmentation capabilities of multi-modal LLMs, and can handle cases involving:
|
| 109 |
+
1. complex reasoning;
|
| 110 |
+
2. world knowledge;
|
| 111 |
+
3. explanatory answers;
|
| 112 |
+
4. multi-turn conversation.
|
| 113 |
+
|
| 114 |
+
**LISA** also demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement.
|
| 115 |
+
|
| 116 |
+
## Experimental results
|
| 117 |
+
<p align="center"> <img src="resources/imgs/table1.jpg" width="80%"> </p>
|
| 118 |
+
|
| 119 |
+
## Installation
|
| 120 |
+
```
|
| 121 |
+
pip install -r requirements.txt
|
| 122 |
+
pip install flash-attn --no-build-isolation
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
## Training
|
| 126 |
+
### Training Data Preparation
|
| 127 |
+
The training data consists of 4 types of data:
|
| 128 |
+
|
| 129 |
+
1. Semantic segmentation datasets: [ADE20K](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip), [COCO-Stuff](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip), [Mapillary](https://www.mapillary.com/dataset/vistas), [PACO-LVIS](https://github.com/facebookresearch/paco/tree/main#dataset-setup), [PASCAL-Part](https://github.com/facebookresearch/VLPart/tree/main/datasets#pascal-part), [COCO Images](http://images.cocodataset.org/zips/train2017.zip)
|
| 130 |
+
|
| 131 |
+
Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the `dataset/coco/` directory.
|
| 132 |
+
|
| 133 |
+
3. Referring segmentation datasets: [refCOCO](https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip), [refCOCO+](https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip), [refCOCOg](https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip), [refCLEF](https://web.archive.org/web/20220413011817/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refclef.zip) ([saiapr_tc-12](https://web.archive.org/web/20220515000000/http://bvisionweb1.cs.unc.edu/licheng/referit/data/images/saiapr_tc-12.zip))
|
| 134 |
+
|
| 135 |
+
Note: the original links of refCOCO series data are down, and we update them with new ones. If the download speed is super slow or unstable, we also provide a [OneDrive link](https://mycuhk-my.sharepoint.com/:f:/g/personal/1155154502_link_cuhk_edu_hk/Em5yELVBvfREodKC94nOFLoBLro_LPxsOxNV44PHRWgLcA?e=zQPjsc) to download. **You must also follow the rules that the original datasets require.**
|
| 136 |
+
|
| 137 |
+
4. Visual Question Answering dataset: [LLaVA-Instruct-150k](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json)
|
| 138 |
+
|
| 139 |
+
5. Reasoning segmentation dataset: [ReasonSeg](https://github.com/dvlab-research/LISA#dataset)
|
| 140 |
+
|
| 141 |
+
Download them from the above links, and organize them as follows.
|
| 142 |
+
|
| 143 |
+
```
|
| 144 |
+
├── dataset
|
| 145 |
+
│ ├── ade20k
|
| 146 |
+
│ │ ├── annotations
|
| 147 |
+
│ │ └── images
|
| 148 |
+
│ ├── coco
|
| 149 |
+
│ │ └── train2017
|
| 150 |
+
│ │ ├── 000000000009.jpg
|
| 151 |
+
│ │ └── ...
|
| 152 |
+
│ ├── cocostuff
|
| 153 |
+
│ │ └── train2017
|
| 154 |
+
│ │ ├── 000000000009.png
|
| 155 |
+
│ │ └── ...
|
| 156 |
+
│ ├── llava_dataset
|
| 157 |
+
│ │ └── llava_instruct_150k.json
|
| 158 |
+
│ ├── mapillary
|
| 159 |
+
│ │ ├── config_v2.0.json
|
| 160 |
+
│ │ ├── testing
|
| 161 |
+
│ │ ├── training
|
| 162 |
+
│ │ └── validation
|
| 163 |
+
│ ├── reason_seg
|
| 164 |
+
│ │ └── ReasonSeg
|
| 165 |
+
│ │ ├── train
|
| 166 |
+
│ │ ├── val
|
| 167 |
+
│ │ └── explanatory
|
| 168 |
+
│ ├── refer_seg
|
| 169 |
+
│ │ ├── images
|
| 170 |
+
│ │ | ├── saiapr_tc-12
|
| 171 |
+
│ │ | └── mscoco
|
| 172 |
+
│ │ | └── images
|
| 173 |
+
│ │ | └── train2014
|
| 174 |
+
│ │ ├── refclef
|
| 175 |
+
│ │ ├── refcoco
|
| 176 |
+
│ │ ├── refcoco+
|
| 177 |
+
│ │ └── refcocog
|
| 178 |
+
│ └── vlpart
|
| 179 |
+
│ ├── paco
|
| 180 |
+
│ │ └── annotations
|
| 181 |
+
│ └── pascal_part
|
| 182 |
+
│ ├── train.json
|
| 183 |
+
│ └── VOCdevkit
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
+
### Pre-trained weights
|
| 187 |
+
|
| 188 |
+
#### LLaVA
|
| 189 |
+
To train LISA-7B or 13B, you need to follow the [instruction](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) to merge the LLaVA delta weights. Typically, we use the final weights `LLaVA-Lightning-7B-v1-1` and `LLaVA-13B-v1-1` merged from `liuhaotian/LLaVA-Lightning-7B-delta-v1-1` and `liuhaotian/LLaVA-13b-delta-v1-1`, respectively. For Llama2, we can directly use the LLaVA full weights `liuhaotian/llava-llama-2-13b-chat-lightning-preview`.
|
| 190 |
+
|
| 191 |
+
#### SAM ViT-H weights
|
| 192 |
+
Download SAM ViT-H pre-trained weights from the [link](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth).
|
| 193 |
+
|
| 194 |
+
### Training
|
| 195 |
+
```
|
| 196 |
+
deepspeed --master_port=24999 train_ds.py \
|
| 197 |
+
--version="PATH_TO_LLaVA" \
|
| 198 |
+
--dataset_dir='./dataset' \
|
| 199 |
+
--vision_pretrained="PATH_TO_SAM" \
|
| 200 |
+
--dataset="sem_seg||refer_seg||vqa||reason_seg" \
|
| 201 |
+
--sample_rates="9,3,3,1" \
|
| 202 |
+
--exp_name="lisa-7b"
|
| 203 |
+
```
|
| 204 |
+
When training is finished, to get the full model weight:
|
| 205 |
+
```
|
| 206 |
+
cd ./runs/lisa-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
### Merge LoRA Weight
|
| 210 |
+
Merge the LoRA weights of `pytorch_model.bin`, save the resulting model into your desired path in the Hugging Face format:
|
| 211 |
+
```
|
| 212 |
+
CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
|
| 213 |
+
--version="PATH_TO_LLaVA" \
|
| 214 |
+
--weight="PATH_TO_pytorch_model.bin" \
|
| 215 |
+
--save_path="PATH_TO_SAVED_MODEL"
|
| 216 |
+
```
|
| 217 |
+
|
| 218 |
+
For example:
|
| 219 |
+
```
|
| 220 |
+
CUDA_VISIBLE_DEVICES="" python3 merge_lora_weights_and_save_hf_model.py \
|
| 221 |
+
--version="./LLaVA/LLaVA-Lightning-7B-v1-1" \
|
| 222 |
+
--weight="lisa-7b/pytorch_model.bin" \
|
| 223 |
+
--save_path="./LISA-7B"
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
### Validation
|
| 227 |
+
```
|
| 228 |
+
deepspeed --master_port=24999 train_ds.py \
|
| 229 |
+
--version="PATH_TO_LISA_HF_Model_Directory" \
|
| 230 |
+
--dataset_dir='./dataset' \
|
| 231 |
+
--vision_pretrained="PATH_TO_SAM" \
|
| 232 |
+
--exp_name="lisa-7b" \
|
| 233 |
+
--eval_only
|
| 234 |
+
```
|
| 235 |
+
|
| 236 |
+
Note: the `v1` model is trained using both `train+val` sets, so please use the `v0` model to reproduce the validation results. (To use the `v0` models, please first checkout to the legacy version repo with `git checkout 0e26916`.)
|
| 237 |
+
|
| 238 |
+
|
| 239 |
+
## Inference
|
| 240 |
+
|
| 241 |
+
To chat with [LISA-13B-llama2-v1](https://huggingface.co/xinlai/LISA-13B-llama2-v1) or [LISA-13B-llama2-v1-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v1-explanatory):
|
| 242 |
+
(Note that `chat.py` currently does not support `v0` models (i.e., `LISA-13B-llama2-v0` and `LISA-13B-llama2-v0-explanatory`), if you want to use the `v0` models, please first checkout to the legacy version repo `git checkout 0e26916`.)
|
| 243 |
+
```
|
| 244 |
+
CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1'
|
| 245 |
+
CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1-explanatory'
|
| 246 |
+
```
|
| 247 |
+
To use `bf16` or `fp16` data type for inference:
|
| 248 |
+
```
|
| 249 |
+
CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='bf16'
|
| 250 |
+
```
|
| 251 |
+
To use `8bit` or `4bit` data type for inference (this enables running 13B model on a single 24G or 12G GPU at some cost of generation quality):
|
| 252 |
+
```
|
| 253 |
+
CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_8bit
|
| 254 |
+
CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_4bit
|
| 255 |
+
```
|
| 256 |
+
Hint: for 13B model, 16-bit inference consumes 30G VRAM with a single GPU, 8-bit inference consumes 16G, and 4-bit inference consumes 9G.
|
| 257 |
+
|
| 258 |
+
After that, input the text prompt and then the image path. For example,
|
| 259 |
+
```
|
| 260 |
+
- Please input your prompt: Where can the driver see the car speed in this image? Please output segmentation mask.
|
| 261 |
+
- Please input the image path: imgs/example1.jpg
|
| 262 |
+
|
| 263 |
+
- Please input your prompt: Can you segment the food that tastes spicy and hot?
|
| 264 |
+
- Please input the image path: imgs/example2.jpg
|
| 265 |
+
```
|
| 266 |
+
The results should be like:
|
| 267 |
+
<p align="center"> <img src="resources/imgs/example1.jpg" width="22%"> <img src="vis_output/example1_masked_img_0.jpg" width="22%"> <img src="resources/imgs/example2.jpg" width="25%"> <img src="vis_output/example2_masked_img_0.jpg" width="25%"> </p>
|
| 268 |
+
|
| 269 |
+
## Deployment
|
| 270 |
+
```
|
| 271 |
+
CUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1 --load_in_4bit'
|
| 272 |
+
CUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1-explanatory --load_in_4bit'
|
| 273 |
+
```
|
| 274 |
+
By default, we use 4-bit quantization. Feel free to delete the `--load_in_4bit` argument for 16-bit inference or replace it with `--load_in_8bit` argument for 8-bit inference.
|
| 275 |
+
|
| 276 |
+
|
| 277 |
+
## Dataset
|
| 278 |
+
In ReasonSeg, we have collected 1218 images (239 train, 200 val, and 779 test). The training and validation sets can be download from <a href="https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing">**this link**</a>.
|
| 279 |
+
|
| 280 |
+
Each image is provided with an annotation JSON file:
|
| 281 |
+
```
|
| 282 |
+
image_1.jpg, image_1.json
|
| 283 |
+
image_2.jpg, image_2.json
|
| 284 |
+
...
|
| 285 |
+
image_n.jpg, image_n.json
|
| 286 |
+
```
|
| 287 |
+
Important keys contained in JSON files:
|
| 288 |
+
```
|
| 289 |
+
- "text": text instructions.
|
| 290 |
+
- "is_sentence": whether the text instructions are long sentences.
|
| 291 |
+
- "shapes": target polygons.
|
| 292 |
+
```
|
| 293 |
+
|
| 294 |
+
The elements of the "shapes" exhibit two categories, namely **"target"** and **"ignore"**. The former category is indispensable for evaluation, while the latter category denotes the ambiguous region and hence disregarded during the evaluation process.
|
| 295 |
+
|
| 296 |
+
We provide a <a href="https://github.com/dvlab-research/LISA/blob/main/utils/data_processing.py">**script**</a> that demonstrates how to process the annotations:
|
| 297 |
+
```
|
| 298 |
+
python3 utils/data_processing.py
|
| 299 |
+
```
|
| 300 |
+
|
| 301 |
+
Besides, we leveraged GPT-3.5 for rephrasing instructions, so images in the training set may have **more than one instructions (but fewer than six)** in the "text" field. During training, users may randomly select one as the text query to obtain a better model.
|
| 302 |
+
|
| 303 |
+
|
| 304 |
+
## Citation
|
| 305 |
+
If you find this project useful in your research, please consider citing:
|
| 306 |
+
|
| 307 |
+
```
|
| 308 |
+
@article{lai2023lisa,
|
| 309 |
+
title={LISA: Reasoning Segmentation via Large Language Model},
|
| 310 |
+
author={Lai, Xin and Tian, Zhuotao and Chen, Yukang and Li, Yanwei and Yuan, Yuhui and Liu, Shu and Jia, Jiaya},
|
| 311 |
+
journal={arXiv preprint arXiv:2308.00692},
|
| 312 |
+
year={2023}
|
| 313 |
+
}
|
| 314 |
+
@article{yang2023improved,
|
| 315 |
+
title={An Improved Baseline for Reasoning Segmentation with Large Language Model},
|
| 316 |
+
author={Yang, Senqiao and Qu, Tianyuan and Lai, Xin and Tian, Zhuotao and Peng, Bohao and Liu, Shu and Jia, Jiaya},
|
| 317 |
+
journal={arXiv preprint arXiv:2312.17240},
|
| 318 |
+
year={2023}
|
| 319 |
+
}
|
| 320 |
+
```
|
| 321 |
+
|
| 322 |
+
## Acknowledgement
|
| 323 |
+
- This work is built upon the [LLaVA](https://github.com/haotian-liu/LLaVA) and [SAM](https://github.com/facebookresearch/segment-anything).
|