File size: 16,836 Bytes
779abe8 cd5b78b 779abe8 cd5b78b 779abe8 6bf91d5 d3aab5b 6bf91d5 779abe8 9b00181 779abe8 0c3eccf 779abe8 d0e7aec 779abe8 3160bb0 f09067b 3160bb0 779abe8 3160bb0 779abe8 cd5b78b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
---
language: en
tags:
- multimodal
- text
- image
- image-to-text
datasets:
- HuggingFaceM4/OBELICS
- laion/laion2B-en
- coyo-700m
- mmc4
pipeline_tag: text-generation
inference: true
---
<br>
<p align="center">
<img src="assets/infimm-logo.webp" alt="InfiMM-logo" width="400"></a>
</p>
<br>
# InfiMM
InfiMM, inspired by the Flamingo architecture, sets itself apart with unique training data and diverse large language models (LLMs). This approach allows InfiMM to maintain the core strengths of Flamingo while offering enhanced capabilities. As the premier open-sourced variant in this domain, InfiMM excels in accessibility and adaptability, driven by community collaboration. It's more than an emulation of Flamingo; it's an innovation in visual language processing.
Our model is another attempt to produce the result reported in the paper "Flamingo: A Large-scale Visual Language Model for Multimodal Understanding" by DeepMind.
Compared with previous open-sourced attempts ([OpenFlamingo](https://github.com/mlfoundations/open_flamingo) and [IDEFIC](https://huggingface.co/blog/idefics)), InfiMM offers a more flexible models, allowing for a wide range of applications.
In particular, InfiMM integrates the latest LLM models into VLM domain the reveals the impact of LLMs with different scales and architectures.
Please note that InfiMM is currently in beta stage and we are continuously working on improving it.
## Model Details
- **Developed by**: Institute of Automation, Chinese Academy of Sciences and ByteDance
- **Model Type**: Visual Language Model (VLM)
- **Language**: English
- **LLMs**: [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), [LLaMA2-13B](https://ai.meta.com/llama/), [Vicuna-13B](https://huggingface.co/lmsys/vicuna-13b-v1.5)
- **Vision Model**: [EVA CLIP](https://huggingface.co/QuanSun/EVA-CLIP)
- **Language(s) (NLP):** en
- **License:** see [License section](#license)
<!---
- **Parent Models:** [QuanSun/EVA-CLIP](https://huggingface.co/QuanSun/EVA-CLIP/blob/main/EVA02_CLIP_L_336_psz14_s6B.pt) and [HuggingFaceH4/zephyr-7b--beta ta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)
-->
## Model Family
Our model consists of several different model. Please see the details below.
| Model | LLM | Vision Encoder | IFT |
| ---------------------- | -------------- | -------------- | --- |
| InfiMM-Zephyr | Zehpyr-7B-beta | ViT-L-336 | No |
| InfiMM-Llama-13B | Llama2-13B | ViT-G-224 | No |
| InfiMM-Vicuna-13B | Vicuna-13B | ViT-E-224 | No |
| InfiMM-Zephyr-Chat | Zehpyr-7B-beta | ViT-L-336 | Yes |
| InfiMM-Llama-13B-Chat | Llama2-13B | ViT-G-224 | Yes |
| InfiMM-Vicuna-13B-Chat | Vicuna-13B | ViT-E-224 | Yes |
<!-- InfiMM-Zephyr-Chat is an light-weighted, open-source re-production of Flamingo-style Multimodal large language models with chat capability that takes sequences of interleaved images and texts as inputs and generates text outputs, with only 9B parameters.
-->
## Demo
Will be released soon.
Our model adopts the Flamingo architecture, leveraging EVA CLIP as the visual encoder and employing LLaMA2, Vicuna, and Zephyr as language models. The visual and language modalities are connected through a Cross Attention module.
## Quickstart
Use the code below to get started with the base model:
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
processor = AutoProcessor.from_pretrained("Infi-MM/infimm-zephyr", trust_remote_code=True)
prompts = [
{
"role": "user",
"content": [
{"image": "assets/infimm-logo.webp"},
"Please explain this image to me.",
],
}
]
inputs = processor(prompts)
# use bf16
model = AutoModelForCausalLM.from_pretrained(
"Infi-MM/infimm-zephyr",
local_files_only=True,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).eval()
inputs = inputs.to(model.device)
inputs["batch_images"] = inputs["batch_images"].to(torch.bfloat16)
generated_ids = model.generate(
**inputs,
min_generation_length=0,
max_generation_length=256,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text)
```
## Training Details
We employed three stages to train our model: pretraining (PT), multi-task training (MTT), and instruction finetuning (IFT). Refer to the table below for detailed configurations in each stage. Due to significant noise in the pretraining data, we aimed to enhance the model's accuracy by incorporating higher-quality data. In the multi-task training (MTT) phase, we utilized substantial training data from diverse datasets. However, as the answer in these data mainly consisted of single words or phrases, the model's conversational ability was limited. Therefore, in the third stage, we introduced a considerable amount of image-text dialogue data (llava665k) for fine-tuning the model's instructions.
### Pretraining (PT)
We follow similar training procedures used in [IDEFICS](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct/blob/main/README.md).
The model is trained on a mixture of image-text pairs and unstructured multimodal web documents. All data are from public sources. Many image URL links are expired, we are capable of only downloading partial samples. We filter low quality data, here are resulting data we used:
| Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Number of Samples | Epochs |
| ---------------------------------------------------------------- | ------------------------------------- | -------------------------- | -------------------------- | ----------------- | ------ |
| [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS) | Unstructured Multimodal Web Documents | - | - | 101M | 1 |
| [MMC4](https://github.com/allenai/mmc4) | Unstructured Multimodal Web Documents | - | - | 53M | 1 |
| [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | - | 115M | 115M | 1 |
| [COYO](https://github.com/kakaobrain/coyo-dataset) | Image-Text Pairs | - | 238M | 238M | 1 |
| [LAION-COCO](https://laion.ai/blog/laion-coco/) | Image-Text Pairs | - | 140M | 140M | 1 |
| [PMD\*](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | - | 20M | 20M | 1 |
\*PMD is only used in models with 13B LLMs, not the 7B Zephyr model.
During pretraining of interleaved image text sample, we apply masked cross-attention, however, we didn't strictly follow Flamingo, which alternate attention of image to its previous text or later text by change of 0.5.
We use the following hyper parameters:
| Categories | Parameters | Value |
| ------------------------ | -------------------------- | -------------------- |
| Perceiver Resampler | Number of Layers | 6 |
| | Number of Latents | 64 |
| | Number of Heads | 16 |
| | Resampler Head Dimension | 96 |
| Training | Sequence Length | 384 (13B) / 792 (7B) |
| | Effective Batch Size | 40\*128 |
| | Max Images per Sample | 6 |
| | Weight Decay | 0.1 |
| | Optimizer | Adam(0.9, 0.999) |
| | Gradient Accumulation Step | 2 |
| Learning Rate | Initial Max | 1e-4 |
| | Decay Schedule | Constant |
| | Warmup Step rate | 0.005 |
| Large-scale Optimization | Gradient Checkpointing | False |
| | Precision | bf16 |
| | ZeRO Optimization | Stage 2 |
### Multi-Task Training (MTT)
Here we use mix_cap_vqa to represent the mixed training set from COCO caption, TextCap, VizWiz Caption, VQAv2, OKVQA, VizWiz VQA, TextVQA, OCRVQA, STVQA, DocVQA, GQA and ScienceQA-image. For caption, we add prefix such as "Please describe the image." before the question. And for QA, we add "Answer the question using a single word or phrase.". Specifically, for VizWiz VQA, we use "When the provided information is insufficient, respond with 'Unanswerable'. Answer the question using a single word or phrase.". While for ScienceQA-image, we use "Answer with the option's letter from the given choices directly."
### Instruction Fine-Tuning (IFT)
For instruction fine-tuning stage, we use the recently released [LLaVA-MIX-665k](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/tree/main).
We use the following hyper parameters:
| Categories | Parameters | Value |
| ------------------------ | -------------------------- | -------------------- |
| Perceiver Resampler | Number of Layers | 6 |
| | Number of Latents | 64 |
| | Number of Heads | 16 |
| | Resampler Head Dimension | 96 |
| Training | Sequence Length | 384 (13B) / 792 (7B) |
| | Effective Batch Size | 64 |
| | Max Images per Sample | 6 |
| | Weight Decay | 0.1 |
| | Optimizer | Adam(0.9, 0.999) |
| | Gradient Accumulation Step | 2 |
| Learning Rate | Initial Max | 1e-5 |
| | Decay Schedule | Constant |
| | Warmup Step rate | 0.005 |
| Large-scale Optimization | Gradient Checkpointing | False |
| | Precision | bf16 |
| | ZeRO Optimization | Stage 2 |
During IFT, similar to pretrain, we keep ViT and LLM frozen for both chat-based LLM (Vicuna and Zephyr). For Llama model, we keep LLM trainable during the IFT stage. We also apply chat-template to process the training samples.
## Evaluation
### PreTraining Evaluation
We evaluate the pretrained models on the following downstream tasks: Image Captioning and VQA. We also compare with our results with [IDEFICS](https://huggingface.co/blog/idefics).
| Model | Shots | COCO CIDEr | Flickr30K CIDEr | VQA v2 Acc | TextVQA Acc | OK-VQA Acc |
| ----------------- | ----- | ---------- | --------------- | ---------- | ----------- | ---------- |
| IDEFICS-9B | 0 | 46 | 27.3 | 50.9 | 25.9 | 38.4 |
| | 4 | 93 | 59.7 | 55.4 | 27.6 | 45.5 |
| IDEFICS-80B | 0 | 91.8 | 53.7 | 60 | 30.9 | 45.2 |
| | 4 | 110.3 | 73.7 | 64.6 | 34.4 | 52.4 |
| InfiMM-Zephyr-7B | 0 | 78.8 | 60.7 | 33.7 | 15.2 | 17.1 |
| | 4 | 108.6 | 71.9 | 59.1 | 34.3 | 50.5 |
| InfiMM-Llama2-13B | 0 | 85.4 | 54.6 | 51.6 | 24.2 | 26.4 |
| | 4 | 125.2 | 87.1 | 66.1 | 38.2 | 55.5 |
| InfiMM-Vicuna13B | 0 | 69.6 | 49.6 | 60.4 | 32.8 | 49.2 |
| | 4 | 118.1 | 81.4 | 64.2 | 38.4 | 53.7 |
### IFT Evaluation
In our analysis, we concentrate on two primary benchmarks for evaluating MLLMs: 1) Multi-choice Question Answering (QA) and 2) Open-ended Evaluation. We've observed that the evaluation metrics for tasks like Visual Question Answering (VQA) and Text-VQA are overly sensitive to exact answer matches. This approach can be misleading, particularly when models provide synonymous but technically accurate responses. Therefore, these metrics have been omitted from our comparison for a more precise assessment. The evaluation results are shown in the table below.
| Model | ScienceQA-Img | MME | MM-VET | InfiMM-Eval | MMbench | MMMU-Val | MMMU-Test |
| ------------------- | ------------- | --------------------- | ------ | ------------ | ------- | -------- | --------- |
| Otter-9B | - | 1292/306 | 24.6 | 32.2 | - | 22.69 | - |
| IDEFICS-9B-Instruct | 60.6 | -/- | - | - | - | 24.53 | - |
| InfiMM-Zephyr-7B | 71.1 | P: 1406<br>C:327 | 32.8 | 36.0 | 59.7 | 39.4 | 35.5 |
| InfiMM-Llama-13b | 73.0 | P: 1444.5<br>C: 337.6 | 39.2 | 0.4559/0.414 | 66.4 | 39.1 | 35.2 |
| InfiMM-Vicuna-13B | 74.0 | P: 1461.2<br>C: 323.5 | 36.0 | 40.0 | 66.7 | 37.6 | 34.6 |
<!--
| Model | TextVQA (no ocr) | OK-VQA | VQAv2 | ScienceQA-Img | GQA | MME | MM-VET | MMMU | InfiMM-Eval | MMbench |
| ----------------- | ---------------- | ------ | ----- | ------------- | ---- | --------------------- | ------ | ---- | ------------ | ------- |
| InfiMM-Zephyr-7B | 36.7 | 55.4 | / | 71.1 | | P: 1406<br>C:327 | 32.8 | 39.4 | 36.0 | 59.7 |
| InfiMM-Llama-13b | 44.6 | 62.3 | 78.5 | 73.0 | 61.2 | P: 1444.5<br>C: 337.6 | 39.2 | 39.1 | 0.4559/0.414 | 66.4 |
| InfiMM-Vicuna-13B | 41.7 | 58.5 | 73.0 | 74.0 | 58.5 | P: 1461.2<br>C: 323.5 | 36.0 | 37.6 | 40.0 | 66.7 |
We select checkpoint after 1 epoch instruction fine-tuning.
| Model | <nobr>ScienceQA <br>acc.</nobr> | <nobr>MME <br>P/C</nobr> | <nobr>MM-Vet</nobr> | <nobr>InfiMM-Eval</nobr> | <nobr>MMMU (val)</nobr> |
| :------------------ | ------------------------------: | -----------------------: | ------------------: | -----------------------: | ----------------------: |
| Otter-9B | - | 1292/306 | 24.6 | 22.69 | 32.2 |
| IDEFICS-9B-Instruct | 60.6 | -/- | - | 24.53 | - |
| InfiMM-Zephyr-Chat | 71.14 | 1406/327 | 33.3 | 35.97 | 39.4 |
-->
<details>
<summary>Leaderboard Details</summary>
<img src="assets/infimm-zephyr-mmmu-val.jpeg" style="zoom:40%;" />
<br>MMMU-Val split results<br>
<img src="assets/infimm-zephyr-mmmu-test.jpeg" style="zoom:40%;" />
<br>MMMU-Test split results<br>
</details>
## Citation
```latex
@misc{InfiMM,
title={InfiMM: Advancing Multimodal Understanding from Flamingo's Legacy through Diverse LLM Integration},
author={InfiMM Team},
url={https://huggingface.co/Infi-MM/},
year={2024}
}
```
## License
<a href="https://creativecommons.org/licenses/by-nc/4.0/deed.en">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Cc_by-nc_icon.svg/600px-Cc_by-nc_icon.svg.png" width="160">
</a>
This project is licensed under the **CC BY-NC 4.0**.
The copyright of the images belongs to the original authors.
See [LICENSE](LICENSE) for more information.
## Contact Us
Please feel free to contact us via email [[email protected]]([email protected]) if you have any questions. |