",
+ "lstrip": false,
+ "normalized": false,
+ "rstrip": false,
+ "single_word": false
+ }
+}
diff --git a/MiniGPT4_Train.md b/MiniGPT4_Train.md
new file mode 100644
index 0000000000000000000000000000000000000000..f9e8a5c23a6f9fcf6dd1129dcc12e5b8ad6721b6
--- /dev/null
+++ b/MiniGPT4_Train.md
@@ -0,0 +1,41 @@
+## Training of MiniGPT-4
+
+The training of MiniGPT-4 contains two alignment stages.
+
+**1. First pretraining stage**
+
+In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets
+to align the vision and language model. To download and prepare the datasets, please check
+our [first stage dataset preparation instruction](dataset/README_1_STAGE.md).
+After the first stage, the visual features are mapped and can be understood by the language
+model.
+To launch the first stage training, run the following command. In our experiments, we use 4 A100.
+You can change the save path in the config file
+[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage1_pretrain.yaml)
+
+```bash
+torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml
+```
+
+A MiniGPT-4 checkpoint with only stage one training can be downloaded
+[here (13B)](https://drive.google.com/file/d/1u9FRRBB3VovP1HxCAlpD9Lw4t4P6-Yq8/view?usp=share_link) or [here (7B)](https://drive.google.com/file/d/1HihQtCEXUyBM1i9DQbaK934wW3TZi-h5/view?usp=share_link).
+Compared to the model after stage two, this checkpoint generate incomplete and repeated sentences frequently.
+
+
+**2. Second finetuning stage**
+
+In the second stage, we use a small high quality image-text pair dataset created by ourselves
+and convert it to a conversation format to further align MiniGPT-4.
+To download and prepare our second stage dataset, please check our
+[second stage dataset preparation instruction](dataset/README_2_STAGE.md).
+To launch the second stage alignment,
+first specify the path to the checkpoint file trained in stage 1 in
+[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage2_finetune.yaml).
+You can also specify the output path there.
+Then, run the following command. In our experiments, we use 1 A100.
+
+```bash
+torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml
+```
+
+After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.
diff --git a/MiniGPTv2.pdf b/MiniGPTv2.pdf
new file mode 100644
index 0000000000000000000000000000000000000000..4218fb33fe33cbcda16e27156bbecdeda4c60aa8
--- /dev/null
+++ b/MiniGPTv2.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:429b0f5e3d70828fd691ef4ffb90c6efa094a8454bf03f8ec00b10fcd443f346
+size 4357853
diff --git a/MiniGPTv2_Train.md b/MiniGPTv2_Train.md
new file mode 100644
index 0000000000000000000000000000000000000000..378a7ad0e788f4ed84466a673c06561224518763
--- /dev/null
+++ b/MiniGPTv2_Train.md
@@ -0,0 +1,24 @@
+## Finetune of MiniGPT-4
+
+
+You firstly need to prepare the dataset. you can follow this step to prepare the dataset.
+our [dataset preparation](dataset/README_MINIGPTv2_FINETUNE.md).
+
+In the train_configs/minigptv2_finetune.yaml, you need to set up the following paths:
+
+llama_model checkpoint path: "/path/to/llama_checkpoint"
+
+ckpt: "/path/to/pretrained_checkpoint"
+
+ckpt save path: "/path/to/save_checkpoint"
+
+For ckpt, you may load from our pretrained model checkpoints:
+| MiniGPT-v2 (after stage-2) | MiniGPT-v2 (after stage-3) | MiniGPT-v2 (online developing demo) |
+|------------------------------|------------------------------|------------------------------|
+| [Download](https://drive.google.com/file/d/1Vi_E7ZtZXRAQcyz4f8E6LtLh2UXABCmu/view?usp=sharing) |[Download](https://drive.google.com/file/d/1HkoUUrjzFGn33cSiUkI-KcT-zysCynAz/view?usp=sharing) | [Download](https://drive.google.com/file/d/1aVbfW7nkCSYx99_vCRyP1sOlQiWVSnAl/view?usp=sharing) |
+
+
+```bash
+torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml
+```
+
diff --git a/README.md b/README.md
index f724e81cd2bf7641a90816a087fb12ece23af101..818adb361784bb8c74951e4d1108aa27dc999539 100644
--- a/README.md
+++ b/README.md
@@ -1,12 +1,212 @@
---
-title: MiniGPT-4 Vicuna Version
-emoji: 🌍
-colorFrom: green
-colorTo: yellow
+title: MiniGPT-4_Vicuna_version
+app_file: demo_v2.py
sdk: gradio
-sdk_version: 4.44.1
-app_file: app.py
-pinned: false
+sdk_version: 3.47.1
---
+# MiniGPT-V
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+**MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning**
+
+Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong☨, Mohamed Elhoseiny☨
+
+☨equal last author
+
+
[](https://www.youtube.com/watch?v=atFCwV2hSY4)
+
+
+ **MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models**
+
+Deyao Zhu*, Jun Chen*, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
+
+*equal contribution
+
+
[](https://colab.research.google.com/drive/1OK4kYsZphwt5DXchKkzMBjYF6jnkqh4R?usp=sharing) [](https://www.youtube.com/watch?v=__tftoxpBAw&feature=youtu.be)
+
+*King Abdullah University of Science and Technology*
+
+## 💡 Get help - [Q&A](https://github.com/Vision-CAIR/MiniGPT-4/discussions/categories/q-a) or [Discord 💬](https://discord.gg/5WdJkjbAeE)
+
+ **Example Community Efforts Built on Top of MiniGPT-4 **
+
+*
**InstructionGPT-4**: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 Lai Wei, Zihao Jiang, Weiran Huang, Lichao Sun, Arxiv, 2023
+
+*
**PatFig**: Generating Short and Long Captions for Patent Figures.", Aubakirova, Dana, Kim Gerdes, and Lufei Liu, ICCVW, 2023
+
+
+*
**SkinGPT-4**: An Interactive Dermatology Diagnostic System with Visual Large Language Model, Juexiao Zhou and Xiaonan He and Liyuan Sun and Jiannan Xu and Xiuying Chen and Yuetan Chu and Longxi Zhou and Xingyu Liao and Bin Zhang and Xin Gao, Arxiv, 2023
+
+
+*
**ArtGPT-4**: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4.", Yuan, Zhengqing, Huiwen Xue, Xinyi Wang, Yongming Liu, Zhuanzhe Zhao, and Kun Wang, Arxiv, 2023
+
+
+
+
+## News
+[Oct.31 2023] We release the evaluation code of our MiniGPT-v2.
+
+[Oct.24 2023] We release the finetuning code of our MiniGPT-v2.
+
+[Oct.13 2023] Breaking! We release the first major update with our MiniGPT-v2
+
+[Aug.28 2023] We now provide a llama 2 version of MiniGPT-4
+
+## Online Demo
+
+Click the image to chat with MiniGPT-v2 around your images
+[](https://minigpt-v2.github.io/)
+
+Click the image to chat with MiniGPT-4 around your images
+[](https://minigpt-4.github.io)
+
+
+## MiniGPT-v2 Examples
+
+
+
+
+
+## MiniGPT-4 Examples
+ | | |
+:-------------------------:|:-------------------------:
+ | 
+ | 
+
+More examples can be found in the [project page](https://minigpt-4.github.io).
+
+
+
+## Getting Started
+### Installation
+
+**1. Prepare the code and the environment**
+
+Git clone our repository, creating a python environment and activate it via the following command
+
+```bash
+git clone https://github.com/Vision-CAIR/MiniGPT-4.git
+cd MiniGPT-4
+conda env create -f environment.yml
+conda activate minigptv
+```
+
+
+**2. Prepare the pretrained LLM weights**
+
+**MiniGPT-v2** is based on Llama2 Chat 7B. For **MiniGPT-4**, we have both Vicuna V0 and Llama 2 version.
+Download the corresponding LLM weights from the following huggingface space via clone the repository using git-lfs.
+
+| Llama 2 Chat 7B | Vicuna V0 13B | Vicuna V0 7B |
+:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:
+[Download](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) | [Downlad](https://huggingface.co/Vision-CAIR/vicuna/tree/main) | [Download](https://huggingface.co/Vision-CAIR/vicuna-7b/tree/main)
+
+
+Then, set the variable *llama_model* in the model config file to the LLM weight path.
+
+* For MiniGPT-v2, set the LLM path
+[here](minigpt4/configs/models/minigpt_v2.yaml#L15) at Line 14.
+
+* For MiniGPT-4 (Llama2), set the LLM path
+[here](minigpt4/configs/models/minigpt4_llama2.yaml#L15) at Line 15.
+
+* For MiniGPT-4 (Vicuna), set the LLM path
+[here](minigpt4/configs/models/minigpt4_vicuna0.yaml#L18) at Line 18
+
+**3. Prepare the pretrained model checkpoints**
+
+Download the pretrained model checkpoints
+
+
+| MiniGPT-v2 (after stage-2) | MiniGPT-v2 (after stage-3) | MiniGPT-v2 (online developing demo)|
+|------------------------------|------------------------------|------------------------------|
+| [Download](https://drive.google.com/file/d/1Vi_E7ZtZXRAQcyz4f8E6LtLh2UXABCmu/view?usp=sharing) |[Download](https://drive.google.com/file/d/1HkoUUrjzFGn33cSiUkI-KcT-zysCynAz/view?usp=sharing) | [Download](https://drive.google.com/file/d/1aVbfW7nkCSYx99_vCRyP1sOlQiWVSnAl/view?usp=sharing) |
+
+
+For **MiniGPT-v2**, set the path to the pretrained checkpoint in the evaluation config file
+in [eval_configs/minigptv2_eval.yaml](eval_configs/minigptv2_eval.yaml#L10) at Line 8.
+
+
+
+| MiniGPT-4 (Vicuna 13B) | MiniGPT-4 (Vicuna 7B) | MiniGPT-4 (LLaMA-2 Chat 7B) |
+|----------------------------|---------------------------|---------------------------------|
+| [Download](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) | [Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing) | [Download](https://drive.google.com/file/d/11nAPjEok8eAGGEG1N2vXo3kBLCg0WgUk/view?usp=sharing) |
+
+For **MiniGPT-4**, set the path to the pretrained checkpoint in the evaluation config file
+in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 8 for Vicuna version or [eval_configs/minigpt4_llama2_eval.yaml](eval_configs/minigpt4_llama2_eval.yaml#L10) for LLama2 version.
+
+
+
+### Launching Demo Locally
+
+For MiniGPT-v2, run
+```
+python demo_v2.py --cfg-path eval_configs/minigptv2_eval.yaml --gpu-id 0
+```
+
+For MiniGPT-4 (Vicuna version), run
+
+```
+python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0
+```
+
+For MiniGPT-4 (Llama2 version), run
+
+```
+python demo.py --cfg-path eval_configs/minigpt4_llama2_eval.yaml --gpu-id 0
+```
+
+
+To save GPU memory, LLMs loads as 8 bit by default, with a beam search width of 1.
+This configuration requires about 23G GPU memory for 13B LLM and 11.5G GPU memory for 7B LLM.
+For more powerful GPUs, you can run the model
+in 16 bit by setting `low_resource` to `False` in the relevant config file:
+
+* MiniGPT-v2: [minigptv2_eval.yaml](eval_configs/minigptv2_eval.yaml#6)
+* MiniGPT-4 (Llama2): [minigpt4_llama2_eval.yaml](eval_configs/minigpt4_llama2_eval.yaml#6)
+* MiniGPT-4 (Vicuna): [minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#6)
+
+Thanks [@WangRongsheng](https://github.com/WangRongsheng), you can also run MiniGPT-4 on [Colab](https://colab.research.google.com/drive/1OK4kYsZphwt5DXchKkzMBjYF6jnkqh4R?usp=sharing)
+
+
+### Training
+For training details of MiniGPT-4, check [here](MiniGPT4_Train.md).
+
+For finetuning details of MiniGPT-v2, check [here](MiniGPTv2_Train.md)
+
+
+### Evaluation
+For finetuning details of MiniGPT-v2, check [here](eval_scripts/EVAL_README.md)
+
+
+## Acknowledgement
+
++ [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) The model architecture of MiniGPT-4 follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!
++ [Lavis](https://github.com/salesforce/LAVIS) This repository is built upon Lavis!
++ [Vicuna](https://github.com/lm-sys/FastChat) The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!
++ [LLaMA](https://github.com/facebookresearch/llama) The strong open-sourced LLaMA 2 language model.
+
+
+If you're using MiniGPT-4/MiniGPT-v2 in your research or applications, please cite using this BibTeX:
+```bibtex
+
+
+@article{chen2023minigptv2,
+ title={MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning},
+ author={Chen, Jun and Zhu, Deyao and Shen, Xiaoqian and Li, Xiang and Liu, Zechu and Zhang, Pengchuan and Krishnamoorthi, Raghuraman and Chandra, Vikas and Xiong, Yunyang and Elhoseiny, Mohamed},
+ year={2023},
+ journal={arXiv preprint arXiv:2310.09478},
+}
+
+@article{zhu2023minigpt,
+ title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
+ author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
+ journal={arXiv preprint arXiv:2304.10592},
+ year={2023}
+}
+```
+
+
+## License
+This repository is under [BSD 3-Clause License](LICENSE.md).
+Many codes are based on [Lavis](https://github.com/salesforce/LAVIS) with
+BSD 3-Clause License [here](LICENSE_Lavis.md).
diff --git a/SECURITY.md b/SECURITY.md
new file mode 100644
index 0000000000000000000000000000000000000000..034e848032092eaf8ef96eac731b6ed5961987f3
--- /dev/null
+++ b/SECURITY.md
@@ -0,0 +1,21 @@
+# Security Policy
+
+## Supported Versions
+
+Use this section to tell people about which versions of your project are
+currently being supported with security updates.
+
+| Version | Supported |
+| ------- | ------------------ |
+| 5.1.x | :white_check_mark: |
+| 5.0.x | :x: |
+| 4.0.x | :white_check_mark: |
+| < 4.0 | :x: |
+
+## Reporting a Vulnerability
+
+Use this section to tell people how to report a vulnerability.
+
+Tell them where to go, how often they can expect to get an update on a
+reported vulnerability, what to expect if the vulnerability is accepted or
+declined, etc.
diff --git a/checkpoints/prerained_minigpt4_7b.pth b/checkpoints/prerained_minigpt4_7b.pth
new file mode 100644
index 0000000000000000000000000000000000000000..6ab50c857882ea19af36c57df4372afd5642888f
--- /dev/null
+++ b/checkpoints/prerained_minigpt4_7b.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:017a9ed588a11ed383711003cf50cf675191420a04689f682fb56fa9bbb8dcbb
+size 37907201
diff --git a/checkpoints/pretrained_minigpt4.pth b/checkpoints/pretrained_minigpt4.pth
new file mode 100644
index 0000000000000000000000000000000000000000..673ebd3025d30f920da5be697c6da1a26735775a
--- /dev/null
+++ b/checkpoints/pretrained_minigpt4.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1e7b8a3c21f146654c21a1a29a577dab2c3bd1aa3b1bc902f39e86954357a811
+size 47369169
diff --git a/checkpoints/pretrained_minigpt4_llama2_7b.pth b/checkpoints/pretrained_minigpt4_llama2_7b.pth
new file mode 100644
index 0000000000000000000000000000000000000000..c1e43ea8d05bae691938c0bd44ba9e550e97dc9a
--- /dev/null
+++ b/checkpoints/pretrained_minigpt4_llama2_7b.pth
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e2904bdc6ff4c95e5903d3c08cc0ad5b3f860be9cdbbd6fd05d99cd35b375096
+size 276957567
diff --git a/dataset/README_1_STAGE.md b/dataset/README_1_STAGE.md
new file mode 100644
index 0000000000000000000000000000000000000000..47ffaaef6ddf1677bb467e116e03b039febda759
--- /dev/null
+++ b/dataset/README_1_STAGE.md
@@ -0,0 +1,96 @@
+## Download the filtered Conceptual Captions, SBU, LAION datasets
+
+### Pre-training datasets download:
+We use the filtered synthetic captions prepared by BLIP. For more details about the dataset, please refer to [BLIP](https://github.com/salesforce/BLIP).
+
+It requires ~2.3T to store LAION and CC3M+CC12M+SBU datasets
+
+Image source | Filtered synthetic caption by ViT-L
+--- | :---:
+CC3M+CC12M+SBU | Download
+LAION115M | Download
+
+This will download two json files
+```
+ccs_synthetic_filtered_large.json
+laion_synthetic_filtered_large.json
+```
+
+## prepare the data step-by-step
+
+
+### setup the dataset folder and move the annotation file to the data storage folder
+```
+export MINIGPT4_DATASET=/YOUR/PATH/FOR/LARGE/DATASET/
+mkdir ${MINIGPT4_DATASET}/cc_sbu
+mkdir ${MINIGPT4_DATASET}/laion
+mv ccs_synthetic_filtered_large.json ${MINIGPT4_DATASET}/cc_sbu
+mv laion_synthetic_filtered_large.json ${MINIGPT4_DATASET}/laion
+```
+
+### Convert the scripts to data storate folder
+```
+cp convert_cc_sbu.py ${MINIGPT4_DATASET}/cc_sbu
+cp download_cc_sbu.sh ${MINIGPT4_DATASET}/cc_sbu
+cp convert_laion.py ${MINIGPT4_DATASET}/laion
+cp download_laion.sh ${MINIGPT4_DATASET}/laion
+```
+
+
+### Convert the laion and cc_sbu annotation file format to be img2dataset format
+```
+cd ${MINIGPT4_DATASET}/cc_sbu
+python convert_cc_sbu.py
+
+cd ${MINIGPT4_DATASET}/laion
+python convert_laion.py
+```
+
+### Download the datasets with img2dataset
+```
+cd ${MINIGPT4_DATASET}/cc_sbu
+sh download_cc_sbu.sh
+cd ${MINIGPT4_DATASET}/laion
+sh download_laion.sh
+```
+
+
+The final dataset structure
+
+```
+.
+├── ${MINIGPT4_DATASET}
+│ ├── cc_sbu
+│ ├── convert_cc_sbu.py
+│ ├── download_cc_sbu.sh
+│ ├── ccs_synthetic_filtered_large.json
+│ ├── ccs_synthetic_filtered_large.tsv
+│ └── cc_sbu_dataset
+│ ├── 00000.tar
+│ ├── 00000.parquet
+│ ...
+│ ├── laion
+│ ├── convert_laion.py
+│ ├── download_laion.sh
+│ ├── laion_synthetic_filtered_large.json
+│ ├── laion_synthetic_filtered_large.tsv
+│ └── laion_dataset
+│ ├── 00000.tar
+│ ├── 00000.parquet
+│ ...
+...
+```
+
+
+## Set up the dataset configuration files
+
+Then, set up the LAION dataset loading path in
+[here](../minigpt4/configs/datasets/laion/defaults.yaml#L5) at Line 5 as
+${MINIGPT4_DATASET}/laion/laion_dataset/{00000..10488}.tar
+
+and the Conceptual Captoin and SBU datasets loading path in
+[here](../minigpt4/configs/datasets/cc_sbu/defaults.yaml#L5) at Line 5 as
+${MINIGPT4_DATASET}/cc_sbu/cc_sbu_dataset/{00000..01255}.tar
+
+
+
diff --git a/dataset/README_2_STAGE.md b/dataset/README_2_STAGE.md
new file mode 100644
index 0000000000000000000000000000000000000000..b826765fef6675ab5f2e3a5dfe619d2e351614d3
--- /dev/null
+++ b/dataset/README_2_STAGE.md
@@ -0,0 +1,19 @@
+## Second Stage Data Preparation
+
+Our second stage dataset can be downloaded from
+[here](https://drive.google.com/file/d/1nJXhoEcy3KTExr17I7BXqY5Y9Lx_-n-9/view?usp=share_link)
+After extraction, you will get a data follder with the following structure:
+
+```
+cc_sbu_align
+├── filter_cap.json
+└── image
+ ├── 2.jpg
+ ├── 3.jpg
+ ...
+```
+
+Put the folder to any path you want.
+Then, set up the dataset path in the dataset config file
+[here](../minigpt4/configs/datasets/cc_sbu/align.yaml#L5) at Line 5.
+
diff --git a/dataset/README_MINIGPTv2_FINETUNE.md b/dataset/README_MINIGPTv2_FINETUNE.md
new file mode 100644
index 0000000000000000000000000000000000000000..b29138d3c18ab3f6432daceea4ff17d3bd1e3fd7
--- /dev/null
+++ b/dataset/README_MINIGPTv2_FINETUNE.md
@@ -0,0 +1,285 @@
+## Download the dataset for finetuning the MiniGPT-v2
+
+
+Download the dataset
+
+Image source | Download path
+--- | :---:
+COCO 2014 images | images captions
+COCO VQA | vqa train vqa val
+Visual Genome | images part1 images part2 image meta data
+TextCaps | images annotations
+RefCOCO | annotations
+RefCOCO+ | annotations
+RefCOCOg | annotations
+OKVQA | annotations
+AOK-VQA | annotations
+OCR-VQA | annotations
+GQA | images annotations
+Filtered flickr-30k | annotations
+Multi-task conversation | annotations
+Filtered unnatural instruction | annotations
+LLaVA | Compelex reasoning Detailed description Conversation
+
+
+
+### COCO captions
+Download the COCO 2014 images and captions
+
+coco 2014 images path
+
+```
+${MINIGPTv2_DATASET}
+├── coco
+│ ├── images
+...
+```
+
+
+coco caption annotation path
+
+```
+${MINIGPTv2_DATASET}
+├── coco_captions
+│ └── annotations
+│ ├── coco_karpathy_train.json
+...
+```
+
+Set **image_path** to the COCO 2014 image folder.
+Similarly, set **ann_path** to the coco_karpathy_train.json path
+- [minigpt4/configs/datasets/coco/caption.yaml](../minigpt4/configs/datasets/coco/caption.yaml)
+
+### COCO VQA
+Download the vqa v2 train and validation json files
+
+```
+├── ${MINIGPTv2_DATASET}
+│ ├── vqav2
+│ ├── vqa_train.json
+| ├── vqa_val.json
+```
+
+Set **image_path** to the COCO 2014 image folder.
+Similarly, set **ann_path** to the vqa_train.json and vqa_val.json path
+- [minigpt4/configs/datasets/coco/defaults_vqa.yaml](../minigpt4/configs/datasets/coco/defaults_vqa.yaml)
+
+
+### Visual genome
+Download visiual genome images and annotation files
+
+```
+${MINIGPTv2_DATASET}
+├── visual_genome
+│ ├── VG_100K
+│ ├── VG_100K_2
+│ └── region_descriptions.json
+│ └── image_data.json
+...
+```
+
+Set **image_path** to visual_genome folder.
+Similarly, set **ann_path** to the visual_genome folder.
+
+- [minigpt4/configs/datasets/vg/ref.yaml](../minigpt4/configs/datasets/vg/ref.yaml)
+
+
+### TextCaps
+Download the TextCaps images and annotation files
+
+```
+├── ${MINIGPTv2_DATASET}
+│ ├── textcaps
+│ ├── train_images
+│ ├── TextCaps_0.1_train.json
+```
+
+Set **image_path** to TextCaps train_images folder.
+Similarly, set **ann_path** to the TextCaps_0.1_train.json path
+
+- [minigpt4/configs/datasets/textcaps/caption.yaml](../minigpt4/configs/datasets/textcaps/caption.yaml)
+
+### RefCOCO, RefCOCO+, RefCOCOg
+Download the RefCOCO, RefCOCO+, RefCOCOg annotation files
+
+```
+
+${MINIGPTv2_DATASET}
+├── refcoco_annotations
+│ ├── refcoco
+│ │ ├── instances.json
+│ │ ├── refs(google).p
+│ │ └── refs(unc).p
+│ ├── refcoco+
+│ │ ├── instances.json
+│ │ └── refs(unc).p
+│ └── refcocog
+│ ├── instances.json
+│ ├── refs(google).p
+│ └─── refs(und).p
+...
+```
+
+
+Set **image_path** to the COCO 2014 image folder.
+Similarly, set **ann_path** in all the following configs to the above folder *refcoco_annotations* that contains refcoco, refcoco+, and refcocog.
+
+- [minigpt4/configs/datasets/coco_bbox/refcoco.yaml](../minigpt4/configs/datasets/coco_bbox/refcoco.yaml)
+- [minigpt4/configs/datasets/coco_bbox/refcocog.yaml](../minigpt4/configs/datasets/coco_bbox/refcocog.yaml)
+- [minigpt4/configs/datasets/coco_bbox/refcocop.yaml](../minigpt4/configs/datasets/coco_bbox/refcocop.yaml)
+- [minigpt4/configs/datasets/coco_bbox/invrefcoco.yaml](../minigpt4/configs/datasets/coco_bbox/invrefcoco.yaml)
+- [minigpt4/configs/datasets/coco_bbox/invrefcocog.yaml](../minigpt4/configs/datasets/coco_bbox/invrefcocog.yaml)
+- [minigpt4/configs/datasets/coco_bbox/invrefcocop.yaml](../minigpt4/configs/datasets/coco_bbox/invrefcocop.yaml)
+
+
+
+
+### OKVQA
+
+
+```
+Location_you_like
+├── ${MINIGPTv2_DATASET}
+│ ├── okvqa
+│ ├── okvqa_train.json
+```
+
+Set **image_path** to the COCO 2014 image folder.
+Similarly, set **ann_path** to the location of the OKVQA dataset
+- [minigpt4/configs/datasets/okvqa/defaults.yaml](../minigpt4/configs/datasets/okvqa/defaults.yaml)
+
+
+### COCO-VQA
+
+- [OK-VQA Input Questions](https://okvqa.allenai.org/static/data/OpenEnded_mscoco_train2014_questions.json.zip)
+- [OK-VQA Annotations](https://okvqa.allenai.org/static/data/mscoco_train2014_annotations.json.zip)
+
+
+### AOK-VQA
+Download the AOK-VQA annotation dataset
+
+```
+export AOKVQA_DIR=YOUR_DATASET_PATH
+mkdir -p ${AOKVQA_DIR}
+curl -fsSL https://prior-datasets.s3.us-east-2.amazonaws.com/aokvqa/aokvqa_v1p0.tar.gz | tar xvz -C ${AOKVQA_DIR}
+```
+
+```
+Location_you_like
+├── ${MINIGPTv2_DATASET}
+│ ├── aokvqa
+│ ├── aokvqa_v1p0_train.json
+```
+
+
+Set **image_path** to the COCO 2014 image folder.
+Similarly, set **ann_path** to the location of the AOKVQA dataset
+- [minigpt4/configs/datasets/aokvqa/defaults.yaml](../minigpt4/configs/datasets/aokvqa/defaults.yaml)
+
+
+
+### OCR-VQA
+Download the OCR-VQA annotation files
+download the images with loadDataset.py script
+
+```
+Location_you_like
+├── ${MINIGPTv2_DATASET}
+│ ├── ocrvqa
+│ ├── images
+│ ├── dataset.json
+```
+
+Set **image_path** as the ocrvqa/images folder.
+Similarly, set **ann_path** to the dataset.json
+- [minigpt4/configs/datasets/ocrvqa/ocrvqa.yaml](../minigpt4/configs/datasets/ocrvqa/ocrvqa.yaml)
+
+### GQA
+Download the GQA annotation files and images
+
+```
+Location_you_like
+├── ${MINIGPTv2_DATASET}
+│ ├── gqa
+│ ├── images
+│ ├── train_balanced_questions.json
+```
+
+Set **image_path** as the gqa/images folder.
+Similarly, set **ann_path** to the train_balanced_questions.json
+- [minigpt4/configs/datasets/gqa/balanced_val.yaml](../minigpt4/configs/datasets/gqa/balanced_val.yaml)
+
+
+
+### filtered Flickr-30k
+Download filtered Flickr-30k images (fill this [form](https://forms.illinois.edu/sec/229675) on official website or from [kaggle](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset/download?datasetVersionNumber=1)) and annotation files
+
+```
+${MINIGPTv2_DATASET}
+├── filtered_flickr
+│ ├── images
+│ ├── captiontobbox.json
+│ ├── groundedcaption.json
+│ └── phrasetobbox.json
+...
+```
+
+Set **image_path** as the flickr-30k images foler.
+Similarly, set **ann_path** to the groundedcaption.json, captiontobbox.json and phrasetobbox.json for the
+grounded image caption, caption to bbox, and phrase to bbox datasets.
+
+- [minigpt4/configs/datasets/flickr/default.yaml](../minigpt4/configs/datasets/flickr/default.yaml)
+- [minigpt4/configs/datasets/flickr/caption_to_phrase.yaml](../minigpt4/configs/datasets/flickr/caption_to_phrase.yaml)
+- [minigpt4/configs/datasets/flickr/object_to_phrase.yaml](../minigpt4/configs/datasets/flickr/object_to_phrase.yaml)
+
+
+### Multi-task conversation
+Download the multi-task converstation dataset
+
+```
+Location_you_like
+${MINIGPTv2_DATASET}
+├── multitask_conversation
+│ └── multitask_conversation.json
+...
+```
+
+Set **image_path** as the COCO 2014 images folder.
+Similarly, set **ann_path** to the multitask_conversation.json file path
+
+- [minigpt4/configs/datasets/multitask_conversation/default.yaml](../minigpt4/configs/datasets/multitask_conversation/default.yaml)
+
+### Unnatural instruction
+Download the filtered unnatural instruction annotation files (we remove the very long sentences from the original unnatural instruction dataset)
+
+```
+Location_you_like
+├── ${MINIGPTv2_DATASET}
+│ ├── unnatural_instructions
+│ ├── filtered_unnatural_instruction.json
+```
+
+There is no image path.
+Similarly, set **ann_path** to the filtered_unnatural_instruction.json file path
+
+- [minigpt4/configs/datasets/nlp/unnatural_instruction.yaml](../minigpt4/configs/datasets/nlp/unnatural_instruction.yaml)
+
+### LLaVA
+
+```
+Location_you_like
+├── ${MINIGPTv2_DATASET}
+│ ├── llava
+│ ├── conversation_58k.json
+│ ├── detail_23k.json
+│ ├── complex_reasoning_77k.json
+```
+
+Set **image_path** to the COCO 2014 image folder.
+Similarly, set **ann_path** to the location of the previous downloaded conversation_58k.json,
+detail_23k.json, and complex_reasoning_77k.json in conversation.yaml, detail.yaml, and reason.yaml, respectively.
+
+
+- [minigpt4/configs/datasets/llava/conversation.yaml](../minigpt4/configs/datasets/llava/conversation.yaml)
+- [minigpt4/configs/datasets/llava/detail.yaml](../minigpt4/configs/datasets/llava/detail.yaml)
+- [minigpt4/configs/datasets/llava/reason.yaml](../minigpt4/configs/datasets/llava/reason.yaml)
diff --git a/dataset/convert_cc_sbu.py b/dataset/convert_cc_sbu.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c325ed3afa3ddb81c5535b5a6febc23d3d5ceee
--- /dev/null
+++ b/dataset/convert_cc_sbu.py
@@ -0,0 +1,20 @@
+import json
+import csv
+
+# specify input and output file paths
+input_file = 'ccs_synthetic_filtered_large.json'
+output_file = 'ccs_synthetic_filtered_large.tsv'
+
+# load JSON data from input file
+with open(input_file, 'r') as f:
+ data = json.load(f)
+
+# extract header and data from JSON
+header = data[0].keys()
+rows = [x.values() for x in data]
+
+# write data to TSV file
+with open(output_file, 'w') as f:
+ writer = csv.writer(f, delimiter='\t')
+ writer.writerow(header)
+ writer.writerows(rows)
diff --git a/dataset/convert_laion.py b/dataset/convert_laion.py
new file mode 100644
index 0000000000000000000000000000000000000000..b793579ce276b72a4313bba4f237b8cb0becb294
--- /dev/null
+++ b/dataset/convert_laion.py
@@ -0,0 +1,20 @@
+import json
+import csv
+
+# specify input and output file paths
+input_file = 'laion_synthetic_filtered_large.json'
+output_file = 'laion_synthetic_filtered_large.tsv'
+
+# load JSON data from input file
+with open(input_file, 'r') as f:
+ data = json.load(f)
+
+# extract header and data from JSON
+header = data[0].keys()
+rows = [x.values() for x in data]
+
+# write data to TSV file
+with open(output_file, 'w') as f:
+ writer = csv.writer(f, delimiter='\t')
+ writer.writerow(header)
+ writer.writerows(rows)
diff --git a/dataset/download_cc_sbu.sh b/dataset/download_cc_sbu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..64082eee0466bdad0fb5d377f4501758a82e805c
--- /dev/null
+++ b/dataset/download_cc_sbu.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+img2dataset --url_list ccs_synthetic_filtered_large.tsv --input_format "tsv"\
+ --url_col "url" --caption_col "caption" --output_format webdataset\
+ --output_folder cc_sbu_dataset --processes_count 16 --thread_count 128 --image_size 224 \
+ --enable_wandb True
diff --git a/dataset/download_laion.sh b/dataset/download_laion.sh
new file mode 100644
index 0000000000000000000000000000000000000000..42beb0c9af3535ef55045a1e8a1333d623f540ad
--- /dev/null
+++ b/dataset/download_laion.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+img2dataset --url_list laion_synthetic_filtered_large.tsv --input_format "tsv"\
+ --url_col "url" --caption_col "caption" --output_format webdataset\
+ --output_folder laion_dataset --processes_count 16 --thread_count 128 --image_size 224 \
+ --enable_wandb True
diff --git a/demo.py b/demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7646c43b51d59a29d5d6fe872c34c27c14981e5
--- /dev/null
+++ b/demo.py
@@ -0,0 +1,171 @@
+import argparse
+import os
+import random
+
+import numpy as np
+import torch
+import torch.backends.cudnn as cudnn
+import gradio as gr
+
+from transformers import StoppingCriteriaList
+
+from minigpt4.common.config import Config
+from minigpt4.common.dist_utils import get_rank
+from minigpt4.common.registry import registry
+from minigpt4.conversation.conversation import Chat, CONV_VISION_Vicuna0, CONV_VISION_LLama2, StoppingCriteriaSub
+
+# imports modules for registration
+from minigpt4.datasets.builders import *
+from minigpt4.models import *
+from minigpt4.processors import *
+from minigpt4.runners import *
+from minigpt4.tasks import *
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description="Demo")
+ parser.add_argument("--cfg-path", required=True, help="path to configuration file.")
+ parser.add_argument("--gpu-id", type=int, default=0, help="specify the gpu to load the model.")
+ parser.add_argument(
+ "--options",
+ nargs="+",
+ help="override some settings in the used config, the key-value pair "
+ "in xxx=yyy format will be merged into config file (deprecate), "
+ "change to --cfg-options instead.",
+ )
+ args = parser.parse_args()
+ return args
+
+
+def setup_seeds(config):
+ seed = config.run_cfg.seed + get_rank()
+
+ random.seed(seed)
+ np.random.seed(seed)
+ torch.manual_seed(seed)
+
+ cudnn.benchmark = False
+ cudnn.deterministic = True
+
+
+# ========================================
+# Model Initialization
+# ========================================
+
+conv_dict = {'pretrain_vicuna0': CONV_VISION_Vicuna0,
+ 'pretrain_llama2': CONV_VISION_LLama2}
+
+print('Initializing Chat')
+args = parse_args()
+cfg = Config(args)
+
+model_config = cfg.model_cfg
+model_config.device_8bit = args.gpu_id
+model_cls = registry.get_model_class(model_config.arch)
+model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))
+
+CONV_VISION = conv_dict[model_config.model_type]
+
+vis_processor_cfg = cfg.datasets_cfg.cc_sbu_align.vis_processor.train
+vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)
+
+stop_words_ids = [[835], [2277, 29937]]
+stop_words_ids = [torch.tensor(ids).to(device='cuda:{}'.format(args.gpu_id)) for ids in stop_words_ids]
+stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
+
+chat = Chat(model, vis_processor, device='cuda:{}'.format(args.gpu_id), stopping_criteria=stopping_criteria)
+print('Initialization Finished')
+
+
+# ========================================
+# Gradio Setting
+# ========================================
+
+
+def gradio_reset(chat_state, img_list):
+ if chat_state is not None:
+ chat_state.messages = []
+ if img_list is not None:
+ img_list = []
+ return None, gr.update(value=None, interactive=True), gr.update(placeholder='Please upload your image first', interactive=False),gr.update(value="Upload & Start Chat", interactive=True), chat_state, img_list
+
+
+def upload_img(gr_img, text_input, chat_state):
+ if gr_img is None:
+ return None, None, gr.update(interactive=True), chat_state, None
+ chat_state = CONV_VISION.copy()
+ img_list = []
+ llm_message = chat.upload_img(gr_img, chat_state, img_list)
+ chat.encode_img(img_list)
+ return gr.update(interactive=False), gr.update(interactive=True, placeholder='Type and press Enter'), gr.update(value="Start Chatting", interactive=False), chat_state, img_list
+
+
+def gradio_ask(user_message, chatbot, chat_state):
+ if len(user_message) == 0:
+ return gr.update(interactive=True, placeholder='Input should not be empty!'), chatbot, chat_state
+ chat.ask(user_message, chat_state)
+ chatbot = chatbot + [[user_message, None]]
+ return '', chatbot, chat_state
+
+
+def gradio_answer(chatbot, chat_state, img_list, num_beams, temperature):
+ llm_message = chat.answer(conv=chat_state,
+ img_list=img_list,
+ num_beams=num_beams,
+ temperature=temperature,
+ max_new_tokens=300,
+ max_length=2000)[0]
+ chatbot[-1][1] = llm_message
+ return chatbot, chat_state, img_list
+
+
+title = """Demo of MiniGPT-4
"""
+description = """This is the demo of MiniGPT-4. Upload your images and start chatting!
"""
+article = """


+"""
+
+#TODO show examples below
+
+with gr.Blocks() as demo:
+ gr.Markdown(title)
+ gr.Markdown(description)
+ gr.Markdown(article)
+
+ with gr.Row():
+ with gr.Column(scale=1):
+ image = gr.Image(type="pil")
+ upload_button = gr.Button(value="Upload & Start Chat", interactive=True, variant="primary")
+ clear = gr.Button("Restart")
+
+ num_beams = gr.Slider(
+ minimum=1,
+ maximum=10,
+ value=1,
+ step=1,
+ interactive=True,
+ label="beam search numbers)",
+ )
+
+ temperature = gr.Slider(
+ minimum=0.1,
+ maximum=2.0,
+ value=1.0,
+ step=0.1,
+ interactive=True,
+ label="Temperature",
+ )
+
+ with gr.Column(scale=2):
+ chat_state = gr.State()
+ img_list = gr.State()
+ chatbot = gr.Chatbot(label='MiniGPT-4')
+ text_input = gr.Textbox(label='User', placeholder='Please upload your image first', interactive=False)
+
+ upload_button.click(upload_img, [image, text_input, chat_state], [image, text_input, upload_button, chat_state, img_list])
+
+ text_input.submit(gradio_ask, [text_input, chatbot, chat_state], [text_input, chatbot, chat_state]).then(
+ gradio_answer, [chatbot, chat_state, img_list, num_beams, temperature], [chatbot, chat_state, img_list]
+ )
+ clear.click(gradio_reset, [chat_state, img_list], [chatbot, image, text_input, upload_button, chat_state, img_list], queue=False)
+
+demo.launch(share=True, enable_queue=True)
diff --git a/demo_v2.py b/demo_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e2deeb430e971820ad22f71efa90fdde523d40a
--- /dev/null
+++ b/demo_v2.py
@@ -0,0 +1,647 @@
+import argparse
+import os
+import random
+from collections import defaultdict
+
+import cv2
+import re
+
+import numpy as np
+from PIL import Image
+import torch
+import html
+import gradio as gr
+
+import torchvision.transforms as T
+import torch.backends.cudnn as cudnn
+
+from minigpt4.common.config import Config
+
+from minigpt4.common.registry import registry
+from minigpt4.conversation.conversation import Conversation, SeparatorStyle, Chat
+
+# imports modules for registration
+from minigpt4.datasets.builders import *
+from minigpt4.models import *
+from minigpt4.processors import *
+from minigpt4.runners import *
+from minigpt4.tasks import *
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description="Demo")
+ parser.add_argument("--cfg-path", default='eval_configs/minigptv2_eval.yaml',
+ help="path to configuration file.")
+ parser.add_argument("--gpu-id", type=int, default=0, help="specify the gpu to load the model.")
+ parser.add_argument(
+ "--options",
+ nargs="+",
+ help="override some settings in the used config, the key-value pair "
+ "in xxx=yyy format will be merged into config file (deprecate), "
+ "change to --cfg-options instead.",
+ )
+ args = parser.parse_args()
+ return args
+
+
+random.seed(42)
+np.random.seed(42)
+torch.manual_seed(42)
+
+cudnn.benchmark = False
+cudnn.deterministic = True
+
+print('Initializing Chat')
+args = parse_args()
+cfg = Config(args)
+
+device = 'cuda:{}'.format(args.gpu_id)
+
+model_config = cfg.model_cfg
+model_config.device_8bit = args.gpu_id
+model_cls = registry.get_model_class(model_config.arch)
+model = model_cls.from_config(model_config).to(device)
+bounding_box_size = 100
+
+vis_processor_cfg = cfg.datasets_cfg.cc_sbu_align.vis_processor.train
+vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)
+
+model = model.eval()
+
+CONV_VISION = Conversation(
+ system="",
+ roles=(r"[INST] ", r" [/INST]"),
+ messages=[],
+ offset=2,
+ sep_style=SeparatorStyle.SINGLE,
+ sep="",
+)
+
+
+def extract_substrings(string):
+ # first check if there is no-finished bracket
+ index = string.rfind('}')
+ if index != -1:
+ string = string[:index + 1]
+
+ pattern = r'(.*?)\}(?!<)'
+ matches = re.findall(pattern, string)
+ substrings = [match for match in matches]
+
+ return substrings
+
+
+def is_overlapping(rect1, rect2):
+ x1, y1, x2, y2 = rect1
+ x3, y3, x4, y4 = rect2
+ return not (x2 < x3 or x1 > x4 or y2 < y3 or y1 > y4)
+
+
+def computeIoU(bbox1, bbox2):
+ x1, y1, x2, y2 = bbox1
+ x3, y3, x4, y4 = bbox2
+ intersection_x1 = max(x1, x3)
+ intersection_y1 = max(y1, y3)
+ intersection_x2 = min(x2, x4)
+ intersection_y2 = min(y2, y4)
+ intersection_area = max(0, intersection_x2 - intersection_x1 + 1) * max(0, intersection_y2 - intersection_y1 + 1)
+ bbox1_area = (x2 - x1 + 1) * (y2 - y1 + 1)
+ bbox2_area = (x4 - x3 + 1) * (y4 - y3 + 1)
+ union_area = bbox1_area + bbox2_area - intersection_area
+ iou = intersection_area / union_area
+ return iou
+
+
+def save_tmp_img(visual_img):
+ file_name = "".join([str(random.randint(0, 9)) for _ in range(5)]) + ".jpg"
+ file_path = "/tmp/gradio" + file_name
+ visual_img.save(file_path)
+ return file_path
+
+
+def mask2bbox(mask):
+ if mask is None:
+ return ''
+ mask = mask.resize([100, 100], resample=Image.NEAREST)
+ mask = np.array(mask)[:, :, 0]
+
+ rows = np.any(mask, axis=1)
+ cols = np.any(mask, axis=0)
+
+ if rows.sum():
+ # Get the top, bottom, left, and right boundaries
+ rmin, rmax = np.where(rows)[0][[0, -1]]
+ cmin, cmax = np.where(cols)[0][[0, -1]]
+ bbox = '{{<{}><{}><{}><{}>}}'.format(cmin, rmin, cmax, rmax)
+ else:
+ bbox = ''
+
+ return bbox
+
+
+def escape_markdown(text):
+ # List of Markdown special characters that need to be escaped
+ md_chars = ['<', '>']
+
+ # Escape each special character
+ for char in md_chars:
+ text = text.replace(char, '\\' + char)
+
+ return text
+
+
+def reverse_escape(text):
+ md_chars = ['\\<', '\\>']
+
+ for char in md_chars:
+ text = text.replace(char, char[1:])
+
+ return text
+
+
+colors = [
+ (255, 0, 0),
+ (0, 255, 0),
+ (0, 0, 255),
+ (210, 210, 0),
+ (255, 0, 255),
+ (0, 255, 255),
+ (114, 128, 250),
+ (0, 165, 255),
+ (0, 128, 0),
+ (144, 238, 144),
+ (238, 238, 175),
+ (255, 191, 0),
+ (0, 128, 0),
+ (226, 43, 138),
+ (255, 0, 255),
+ (0, 215, 255),
+]
+
+color_map = {
+ f"{color_id}": f"#{hex(color[2])[2:].zfill(2)}{hex(color[1])[2:].zfill(2)}{hex(color[0])[2:].zfill(2)}" for
+ color_id, color in enumerate(colors)
+}
+
+used_colors = colors
+
+
+def visualize_all_bbox_together(image, generation):
+ if image is None:
+ return None, ''
+
+ generation = html.unescape(generation)
+
+ image_width, image_height = image.size
+ image = image.resize([500, int(500 / image_width * image_height)])
+ image_width, image_height = image.size
+
+ string_list = extract_substrings(generation)
+ if string_list: # it is grounding or detection
+ mode = 'all'
+ entities = defaultdict(list)
+ i = 0
+ j = 0
+ for string in string_list:
+ try:
+ obj, string = string.split('
')
+ except ValueError:
+ print('wrong string: ', string)
+ continue
+ bbox_list = string.split('')
+ flag = False
+ for bbox_string in bbox_list:
+ integers = re.findall(r'-?\d+', bbox_string)
+ if len(integers) == 4:
+ x0, y0, x1, y1 = int(integers[0]), int(integers[1]), int(integers[2]), int(integers[3])
+ left = x0 / bounding_box_size * image_width
+ bottom = y0 / bounding_box_size * image_height
+ right = x1 / bounding_box_size * image_width
+ top = y1 / bounding_box_size * image_height
+
+ entities[obj].append([left, bottom, right, top])
+
+ j += 1
+ flag = True
+ if flag:
+ i += 1
+ else:
+ integers = re.findall(r'-?\d+', generation)
+
+ if len(integers) == 4: # it is refer
+ mode = 'single'
+
+ entities = list()
+ x0, y0, x1, y1 = int(integers[0]), int(integers[1]), int(integers[2]), int(integers[3])
+ left = x0 / bounding_box_size * image_width
+ bottom = y0 / bounding_box_size * image_height
+ right = x1 / bounding_box_size * image_width
+ top = y1 / bounding_box_size * image_height
+ entities.append([left, bottom, right, top])
+ else:
+ # don't detect any valid bbox to visualize
+ return None, ''
+
+ if len(entities) == 0:
+ return None, ''
+
+ if isinstance(image, Image.Image):
+ image_h = image.height
+ image_w = image.width
+ image = np.array(image)
+
+ elif isinstance(image, str):
+ if os.path.exists(image):
+ pil_img = Image.open(image).convert("RGB")
+ image = np.array(pil_img)[:, :, [2, 1, 0]]
+ image_h = pil_img.height
+ image_w = pil_img.width
+ else:
+ raise ValueError(f"invaild image path, {image}")
+ elif isinstance(image, torch.Tensor):
+
+ image_tensor = image.cpu()
+ reverse_norm_mean = torch.tensor([0.48145466, 0.4578275, 0.40821073])[:, None, None]
+ reverse_norm_std = torch.tensor([0.26862954, 0.26130258, 0.27577711])[:, None, None]
+ image_tensor = image_tensor * reverse_norm_std + reverse_norm_mean
+ pil_img = T.ToPILImage()(image_tensor)
+ image_h = pil_img.height
+ image_w = pil_img.width
+ image = np.array(pil_img)[:, :, [2, 1, 0]]
+ else:
+ raise ValueError(f"invaild image format, {type(image)} for {image}")
+
+ indices = list(range(len(entities)))
+
+ new_image = image.copy()
+
+ previous_bboxes = []
+ # size of text
+ text_size = 0.5
+ # thickness of text
+ text_line = 1 # int(max(1 * min(image_h, image_w) / 512, 1))
+ box_line = 2
+ (c_width, text_height), _ = cv2.getTextSize("F", cv2.FONT_HERSHEY_COMPLEX, text_size, text_line)
+ base_height = int(text_height * 0.675)
+ text_offset_original = text_height - base_height
+ text_spaces = 2
+
+ # num_bboxes = sum(len(x[-1]) for x in entities)
+ used_colors = colors # random.sample(colors, k=num_bboxes)
+
+ color_id = -1
+ for entity_idx, entity_name in enumerate(entities):
+ if mode == 'single' or mode == 'identify':
+ bboxes = entity_name
+ bboxes = [bboxes]
+ else:
+ bboxes = entities[entity_name]
+ color_id += 1
+ for bbox_id, (x1_norm, y1_norm, x2_norm, y2_norm) in enumerate(bboxes):
+ skip_flag = False
+ orig_x1, orig_y1, orig_x2, orig_y2 = int(x1_norm), int(y1_norm), int(x2_norm), int(y2_norm)
+
+ color = used_colors[entity_idx % len(used_colors)] # tuple(np.random.randint(0, 255, size=3).tolist())
+ new_image = cv2.rectangle(new_image, (orig_x1, orig_y1), (orig_x2, orig_y2), color, box_line)
+
+ if mode == 'all':
+ l_o, r_o = box_line // 2 + box_line % 2, box_line // 2 + box_line % 2 + 1
+
+ x1 = orig_x1 - l_o
+ y1 = orig_y1 - l_o
+
+ if y1 < text_height + text_offset_original + 2 * text_spaces:
+ y1 = orig_y1 + r_o + text_height + text_offset_original + 2 * text_spaces
+ x1 = orig_x1 + r_o
+
+ # add text background
+ (text_width, text_height), _ = cv2.getTextSize(f" {entity_name}", cv2.FONT_HERSHEY_COMPLEX, text_size,
+ text_line)
+ text_bg_x1, text_bg_y1, text_bg_x2, text_bg_y2 = x1, y1 - (
+ text_height + text_offset_original + 2 * text_spaces), x1 + text_width, y1
+
+ for prev_bbox in previous_bboxes:
+ if computeIoU((text_bg_x1, text_bg_y1, text_bg_x2, text_bg_y2), prev_bbox['bbox']) > 0.95 and \
+ prev_bbox['phrase'] == entity_name:
+ skip_flag = True
+ break
+ while is_overlapping((text_bg_x1, text_bg_y1, text_bg_x2, text_bg_y2), prev_bbox['bbox']):
+ text_bg_y1 += (text_height + text_offset_original + 2 * text_spaces)
+ text_bg_y2 += (text_height + text_offset_original + 2 * text_spaces)
+ y1 += (text_height + text_offset_original + 2 * text_spaces)
+
+ if text_bg_y2 >= image_h:
+ text_bg_y1 = max(0, image_h - (text_height + text_offset_original + 2 * text_spaces))
+ text_bg_y2 = image_h
+ y1 = image_h
+ break
+ if not skip_flag:
+ alpha = 0.5
+ for i in range(text_bg_y1, text_bg_y2):
+ for j in range(text_bg_x1, text_bg_x2):
+ if i < image_h and j < image_w:
+ if j < text_bg_x1 + 1.35 * c_width:
+ # original color
+ bg_color = color
+ else:
+ # white
+ bg_color = [255, 255, 255]
+ new_image[i, j] = (alpha * new_image[i, j] + (1 - alpha) * np.array(bg_color)).astype(
+ np.uint8)
+
+ cv2.putText(
+ new_image, f" {entity_name}", (x1, y1 - text_offset_original - 1 * text_spaces),
+ cv2.FONT_HERSHEY_COMPLEX, text_size, (0, 0, 0), text_line, cv2.LINE_AA
+ )
+
+ previous_bboxes.append(
+ {'bbox': (text_bg_x1, text_bg_y1, text_bg_x2, text_bg_y2), 'phrase': entity_name})
+
+ if mode == 'all':
+ def color_iterator(colors):
+ while True:
+ for color in colors:
+ yield color
+
+ color_gen = color_iterator(colors)
+
+ # Add colors to phrases and remove
+ def colored_phrases(match):
+ phrase = match.group(1)
+ color = next(color_gen)
+ return f'{phrase}'
+
+ generation = re.sub(r'{<\d+><\d+><\d+><\d+>}|', '', generation)
+ generation_colored = re.sub(r'(.*?)
', colored_phrases, generation)
+ else:
+ generation_colored = ''
+
+ pil_image = Image.fromarray(new_image)
+ return pil_image, generation_colored
+
+
+def gradio_reset(chat_state, img_list):
+ if chat_state is not None:
+ chat_state.messages = []
+ if img_list is not None:
+ img_list = []
+ return None, gr.update(value=None, interactive=True), gr.update(placeholder='Upload your image and chat',
+ interactive=True), chat_state, img_list
+
+
+def image_upload_trigger(upload_flag, replace_flag, img_list):
+ # set the upload flag to true when receive a new image.
+ # if there is an old image (and old conversation), set the replace flag to true to reset the conv later.
+ upload_flag = 1
+ if img_list:
+ replace_flag = 1
+ return upload_flag, replace_flag
+
+
+def example_trigger(text_input, image, upload_flag, replace_flag, img_list):
+ # set the upload flag to true when receive a new image.
+ # if there is an old image (and old conversation), set the replace flag to true to reset the conv later.
+ upload_flag = 1
+ if img_list or replace_flag == 1:
+ replace_flag = 1
+
+ return upload_flag, replace_flag
+
+
+def gradio_ask(user_message, chatbot, chat_state, gr_img, img_list, upload_flag, replace_flag):
+ if len(user_message) == 0:
+ text_box_show = 'Input should not be empty!'
+ else:
+ text_box_show = ''
+
+ if isinstance(gr_img, dict):
+ gr_img, mask = gr_img['image'], gr_img['mask']
+ else:
+ mask = None
+
+ if '[identify]' in user_message:
+ # check if user provide bbox in the text input
+ integers = re.findall(r'-?\d+', user_message)
+ if len(integers) != 4: # no bbox in text
+ bbox = mask2bbox(mask)
+ user_message = user_message + bbox
+
+ if chat_state is None:
+ chat_state = CONV_VISION.copy()
+
+ if upload_flag:
+ if replace_flag:
+ chat_state = CONV_VISION.copy() # new image, reset everything
+ replace_flag = 0
+ chatbot = []
+ img_list = []
+ llm_message = chat.upload_img(gr_img, chat_state, img_list)
+ upload_flag = 0
+
+ chat.ask(user_message, chat_state)
+
+ chatbot = chatbot + [[user_message, None]]
+
+ if '[identify]' in user_message:
+ visual_img, _ = visualize_all_bbox_together(gr_img, user_message)
+ if visual_img is not None:
+ file_path = save_tmp_img(visual_img)
+ chatbot = chatbot + [[(file_path,), None]]
+
+ return text_box_show, chatbot, chat_state, img_list, upload_flag, replace_flag
+
+
+def gradio_answer(chatbot, chat_state, img_list, temperature):
+ llm_message = chat.answer(conv=chat_state,
+ img_list=img_list,
+ temperature=temperature,
+ max_new_tokens=500,
+ max_length=2000)[0]
+ chatbot[-1][1] = llm_message
+ return chatbot, chat_state
+
+
+def gradio_stream_answer(chatbot, chat_state, img_list, temperature):
+ if len(img_list) > 0:
+ if not isinstance(img_list[0], torch.Tensor):
+ chat.encode_img(img_list)
+ streamer = chat.stream_answer(conv=chat_state,
+ img_list=img_list,
+ temperature=temperature,
+ max_new_tokens=500,
+ max_length=2000)
+ output = ''
+ for new_output in streamer:
+ escapped = escape_markdown(new_output)
+ output += escapped
+ chatbot[-1][1] = output
+ yield chatbot, chat_state
+ chat_state.messages[-1][1] = '