Update README.md
Browse files
README.md
CHANGED
@@ -1 +1,197 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
<!-- <div align="center">
|
3 |
+
<img src="assets/halva_icon.png" alt="HALVA" style="width:auto;height:144px;">
|
4 |
+
</div>
|
5 |
+
<h1 align="center">
|
6 |
+
HALVA
|
7 |
+
</h1> -->
|
8 |
+
|
9 |
+
<h1 align="center">
|
10 |
+
Data-Augmented Phrase-Level Alignment for Mitigating Object Hallucination
|
11 |
+
</h1>
|
12 |
+
|
13 |
+
|
14 |
+
<h3 align="center">
|
15 |
+
ICLR 2025
|
16 |
+
</h3>
|
17 |
+
<h3 align="center">
|
18 |
+
<a href="https://www.pritamsarkar.com">Pritam Sarkar</a>
|
19 |
+
|
20 |
+
Sayna Ebrahimi
|
21 |
+
|
22 |
+
Ali Etemad
|
23 |
+
<br>
|
24 |
+
Ahmad Beirami
|
25 |
+
|
26 |
+
Sercan O Arik
|
27 |
+
|
28 |
+
Tomas Pfister
|
29 |
+
</h3>
|
30 |
+
|
31 |
+
<h4 align="center">
|
32 |
+
<a href="https://arxiv.org/abs/2405.18654">[arXiV]</a>
|
33 |
+
<a href="https://openreview.net/forum?id=yG1fW8igzP">[OpenReview]</a>
|
34 |
+
<a href="https://github.com/pritamqu/HALVA/">[GitHub]</a>
|
35 |
+
<a href="https://huggingface.co/collections/pritamqu/halva-6797efacaa78d98bccb8e57a">[Model Weights 🤗]</a>
|
36 |
+
<a href="./?tab=readme-ov-file#data">[Training Data]</a>
|
37 |
+
</h5>
|
38 |
+
|
39 |
+
|
40 |
+
<hr>
|
41 |
+
|
42 |
+
Please see our [GitHUb](https://github.com/pritamqu/HALVA/) repo for details.
|
43 |
+
|
44 |
+
### Setup environment
|
45 |
+
|
46 |
+
```
|
47 |
+
conda create -n halva python=3.10 -y
|
48 |
+
conda activate halva
|
49 |
+
pip install --upgrade pip
|
50 |
+
pip install -r req.txt
|
51 |
+
module load cuda/11.7.1
|
52 |
+
pip install flash-attn --no-build-isolation
|
53 |
+
```
|
54 |
+
|
55 |
+
### Try HALVA!
|
56 |
+
|
57 |
+
We share a minimal setup to quickly try our HALVA! See this [notebook](https://github.com/pritamqu/HALVA/blob/master/try_halva.ipynb).
|
58 |
+
|
59 |
+
### Model weights
|
60 |
+
|
61 |
+
- [HALVA 7B](https://huggingface.co/pritamqu/halva7b-lora)
|
62 |
+
- [HALVA 13B](https://huggingface.co/pritamqu/halva13b-lora)
|
63 |
+
- [HALVA 13B/384](https://huggingface.co/pritamqu/halva13b384-lora)
|
64 |
+
|
65 |
+
### Training HALVA
|
66 |
+
|
67 |
+
### Data
|
68 |
+
|
69 |
+
**generative data augmented contrastive samples**
|
70 |
+
- Vision-language instructions and their correct and hallucinated responses are available here: [data](https://github.com/pritamqu/HALVA/blob/master/data/data.json)
|
71 |
+
- Download the images from Visual Genome and save both part 1 and part 2 as `data/vg/VG_100K` and `data/vg/VG_100K_2`
|
72 |
+
|
73 |
+
**reference samples**
|
74 |
+
|
75 |
+
- A random subset from [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/tree/main). For reproducibility, we share the actual subset that has been used in our study: [ref data](data/ref_data.json)
|
76 |
+
- Image sources:
|
77 |
+
- MSCOCO - download them as `data/MSCOCO2017`
|
78 |
+
- TextVQA - download them as `data/textvqa`
|
79 |
+
- GQA - download them as `data/gqa`
|
80 |
+
- OCR-VQA - download them as `data/ocr_vqa`
|
81 |
+
|
82 |
+
|
83 |
+
### Train
|
84 |
+
|
85 |
+
- The base model LLaVA-v1.5 weights can be found here: [7B](https://huggingface.co/liuhaotian/llava-v1.5-7b) and [13B](https://huggingface.co/liuhaotian/llava-v1.5-13b).
|
86 |
+
- We use 4-A100 80GB GPUs for training, which takes 1.5 hours and 3 hours for training 7B and 13B variants, respectively. If you are using different GPUs, please make sure to match our default batch_size x gradient accumulation steps, for optimal performance with the default hyperparameters.
|
87 |
+
- The following training script can be used to train HALVA that uses LLaVA 1.5 as the base model:
|
88 |
+
- HALVA-7B: `src/hallava_7b.sh`
|
89 |
+
- HALVA-13B: `src/hallava_13b.sh`
|
90 |
+
|
91 |
+
|
92 |
+
### Evaluation on hallucination benchmarks
|
93 |
+
|
94 |
+
Choose the HALVA variant and their base model. We provide sample validation scripts for evaluation, **please make sure to update the paths based on your setup**.
|
95 |
+
|
96 |
+
```
|
97 |
+
MODEL="halva13b-lora"
|
98 |
+
MODEL_BASE="liuhaotian/llava-v1.5-13b"
|
99 |
+
|
100 |
+
# OR
|
101 |
+
|
102 |
+
MODEL="halva7b-lora"
|
103 |
+
MODEL_BASE="liuhaotian/llava-v1.5-7b"
|
104 |
+
```
|
105 |
+
|
106 |
+
#### CHAIR
|
107 |
+
|
108 |
+
- Download the validation images from [MSCOCO2014](https://cocodataset.org/#download) and store them as `data/MSCOCO2014/val2014`. We use the same 500 images for validation, as used in [prior work](https://github.com/yuezih/less-is-more/blob/main/CHAIR-eval/data/chair-500.jsonl).
|
109 |
+
- You can use the given sample script for evaluation.
|
110 |
+
|
111 |
+
```
|
112 |
+
##### run chair
|
113 |
+
bash src/evaluate_hall/chair.sh ${MODEL} ${MODEL_BASE}
|
114 |
+
```
|
115 |
+
|
116 |
+
#### MME-Hall
|
117 |
+
|
118 |
+
- MME-Hall is a subset of MME consisting of `existence`, `count`, `position`, and `color`.
|
119 |
+
- You can follow the official instructions for MME evaluation: [link](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) and download the MME benchmark.
|
120 |
+
- Once the data is downloaded you can use the given sample script for evaluation.
|
121 |
+
|
122 |
+
```
|
123 |
+
##### run mme
|
124 |
+
bash src/evaluate_hall/mme.sh ${MODEL} ${MODEL_BASE}
|
125 |
+
```
|
126 |
+
|
127 |
+
#### AMBER
|
128 |
+
|
129 |
+
- Download the validation images are from the source repo [AMBER](https://github.com/junyangwang0410/AMBER/tree/master) and keep them as `data/amber/image/`.
|
130 |
+
- Download the annotation [data](https://github.com/junyangwang0410/AMBER/tree/master/data) directory and save as `eval_hall/amber/data`.
|
131 |
+
- Once the data is downloaded you can use the given sample script for evaluation.
|
132 |
+
|
133 |
+
|
134 |
+
```
|
135 |
+
##### run amber evaluation on 4 GPUs in parallel if available, else run sequentially by removing & from the end
|
136 |
+
bash src/evaluate_hall/amber.sh g ${MODEL} ${MODEL_BASE} 0 &
|
137 |
+
bash src/evaluate_hall/amber.sh da ${MODEL} ${MODEL_BASE} 1 &
|
138 |
+
bash src/evaluate_hall/amber.sh dr ${MODEL} ${MODEL_BASE} 2 &
|
139 |
+
bash src/evaluate_hall/amber.sh de ${MODEL} ${MODEL_BASE} 3 &
|
140 |
+
wait
|
141 |
+
# get amber f1 for all discriminative tasks
|
142 |
+
bash src/evaluate_hall/amber_f1.sh ${MODEL}
|
143 |
+
```
|
144 |
+
|
145 |
+
#### MMHal-Bench
|
146 |
+
|
147 |
+
- The validation data will be directly downloaded from HuggingFace. You can use the given sample script for evaluation.
|
148 |
+
|
149 |
+
```
|
150 |
+
##### run mmhal-bench
|
151 |
+
bash src/evaluate_hall/mmhal.sh ${MODEL} ${MODEL_BASE} 0
|
152 |
+
```
|
153 |
+
|
154 |
+
|
155 |
+
#### HallusionBench
|
156 |
+
|
157 |
+
- Download the validation images from [link](https://drive.google.com/file/d/1eeO1i0G9BSZTE1yd5XeFwmrbe1hwyf_0/view?usp=sharing) and save them in `data/hallusion_bench`.
|
158 |
+
- Download the annotation files from [link](https://github.com/tianyi-lab/HallusionBench/blob/main/HallusionBench.json) and save them in `eval_hall/hallusion_bench`.
|
159 |
+
- For more details, you can check the [official repo](https://github.com/tianyi-lab/HallusionBench). You can use the given sample script for evaluation.
|
160 |
+
|
161 |
+
```
|
162 |
+
##### run halusion-bench
|
163 |
+
bash src/evaluate_hall/hallusionbench.sh ${MODEL} ${MODEL_BASE} 0
|
164 |
+
```
|
165 |
+
|
166 |
+
|
167 |
+
### Evaluation on general vision-language tasks
|
168 |
+
|
169 |
+
In addition to the above-mentioned evaluation on hallucination benchmarks, we also evaluate on general vision-language benchmarks. For those, we directly follow LLaVA repo as follows:
|
170 |
+
|
171 |
+
- [VQA](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#vqav2)
|
172 |
+
- [MM-Vet](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#mm-vet)
|
173 |
+
- [TextVQA](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#textvqa)
|
174 |
+
- [MME](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#mme)
|
175 |
+
|
176 |
+
### VILA
|
177 |
+
|
178 |
+
The above instructions are mainly related to the LLaVA 1.5 based checkpoints, you can find the VILA codes inside `*_vila` directories.
|
179 |
+
|
180 |
+
### Citation
|
181 |
+
If you find this repository useful, please consider giving a star :star: and citation using the given BibTeX entry:
|
182 |
+
|
183 |
+
```
|
184 |
+
@misc{sarkar2024halva,
|
185 |
+
title={Data-Augmented Phrase-Level Alignment for Mitigating Object Hallucination},
|
186 |
+
author={Pritam Sarkar and Sayna Ebrahimi and Ali Etemad and Ahmad Beirami and Sercan Ö. Arık and Tomas Pfister},
|
187 |
+
year={2024},
|
188 |
+
eprint={2405.18654},
|
189 |
+
archivePrefix={arXiv},
|
190 |
+
primaryClass={cs.CV}
|
191 |
+
}
|
192 |
+
```
|
193 |
+
|
194 |
+
### Acknowledgement
|
195 |
+
|
196 |
+
This code base is built upon [LLaVA](https://github.com/haotian-liu/LLaVA/tree/main) and [VILA](https://github.com/NVlabs/VILA).
|
197 |
+
|