File size: 9,285 Bytes
58616f3 6d42d71 fe07fd3 21f130b 191f434 32d223b 191f434 21f130b aaf5e1b 21f130b b1514a5 21f130b 59a86a0 b1514a5 fb78cf4 b1514a5 59a86a0 ad5cce3 fab6986 59a86a0 b1514a5 59a86a0 b1514a5 ad5cce3 b1514a5 ad5cce3 b1514a5 59a86a0 b1514a5 21f130b ad5cce3 21f130b 355b295 21f130b ad5cce3 21f130b ad5cce3 21f130b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
---
license: apache-2.0
language:
- en
base_model:
- ibm-granite/granite-3.1-2b-instruct
library_name: transformers
---
# granite-vision-3.1-2b-preview
**Model Summary:**
granite-vision-3.1-2b-preview is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. The model was trained on a meticulously curated instruction-following dataset, comprising diverse public datasets and synthetic datasets tailored to support a wide range of document understanding and general image tasks. It was trained by fine-tuning a Granite large language model (https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) with both image and text modalities.
_Note:_
We denote our model as Granite-Vision-3.1-2B-Preview, where the version 3.1 and size 2B of the base large language model
are explicitly indicated. However, when considering the integrated vision encoder and projector, the total parameter count of our
model increases to 3 billion parameters.
**Evaluations:**
We evaluated Granite Vision 3.1 alongside other vision-language models (VLMs) in the 1B-4B parameter range using the standard llms-eval benchmark. The evaluation spanned multiple public benchmarks, with particular emphasis on document understanding tasks while also including general visual question-answering benchmarks.
| | Molmo-E 1B | InternVL2 2B | Phi3v 4B | Phi3.5v 4B | Granite Vision 3B |
|-----------|--------------|----------------|-------------|------------|------------|
| **Document benchmarks** |
| DocVQA | 0.66 | 0.87 | 0.87 | **0.88** | **0.88** |
| ChartQA | 0.60 | 0.75 | 0.81 | 0.82 | **0.86** |
| TextVQA | 0.62 | 0.72 | 0.69 | 0.7 | **0.76** |
| AI2D | 0.63 | 0.74 | **0.79** | **0.79** | 0.78 |
| InfoVQA | 0.44 | 0.58 | 0.55 | 0.61 | **0.63** |
| OCRBench | 0.65 | **0.75** | 0.64 | 0.64 | **0.75** |
| LiveXiv VQA | 0.47 | 0.51 | **0.61** | - | **0.61** |
| LiveXiv TQA | 0.36 | 0.38 | 0.48 | - | **0.55** |
| **Other benchmarks** |
| MMMU | 0.32 | 0.35 | 0.42 | **0.44** | 0.35 |
| VQAv2 | 0.57 | 0.75 | 0.76 | 0.77 | **0.81** |
| RealWorldQA | 0.55 | 0.34 | 0.60 | 0.58 | **0.65** |
| VizWiz VQA | 0.49 | 0.46 | 0.57 | 0.57 | **0.64** |
| OK VQA | 0.40 | 0.44 | 0.51 | 0.53 | **0.57** |
- **Paper:** coming soon
- **Release Date**: Jan 31st, 2025
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
**Supported Languages:**
English
**Intended Use:**
The model is intended to be used in enterprise applications that involve processing visual and text data. In particular, the model is well-suited for a range of visual document understanding tasks, such as analyzing tables and charts, performing optical character recognition (OCR), and answering questions based on document content. Additionally, its capabilities extend to general image understanding, enabling it to be applied to a broader range of business applications. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model.
## Generation:
Granite Vision model is supported natively `transformers` from the `main` branch. Below is a simple example of how to use the `granite-vision-3.1-2b-preview` model.
### Usage with `transformers`
First, make sure to build transormers from source following instructions [here](https://huggingface.co/docs/transformers/v4.48.2/en/installation#install-from-source):
```shell
pip install git+https://github.com/huggingface/transformers
```
Then run the code:
```python
from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = "ibm-granite/granite-vision-3.1-2b-preview"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path).to(device)
# prepare image and text prompt, using the appropriate prompt template
img_path = hf_hub_download(repo_id=model_path, filename='example.png')
conversation = [
{
"role": "user",
"content": [
{"type": "image", "url": img_path},
{"type": "text", "text": "What is the highest scoring model on ChartQA and what is its score?"},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(device)
# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
```
### Usage with vLLM
The model can also be loaded with `vLLM`. First make sure to install the following libraries:
```shell
pip install torch torchvision torchaudio
pip install vllm==0.6.6
```
Then, copy the snippet from the section that is relevant for your use case.
```python
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from huggingface_hub import hf_hub_download
from PIL import Image
model_path = "ibm-granite/granite-vision-3.1-2b-preview"
model = LLM(
model=model_path,
limit_mm_per_prompt={"image": 1},
)
sampling_params = SamplingParams(
temperature=0.2,
max_tokens=64,
)
# Define the question we want to answer and format the prompt
image_token = "<image>"
system_prompt = "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
question = "What is the highest scoring model on ChartQA and what is its score?"
prompt = f"{system_prompt}<|user|>\n{image_token}\n{question}\n<|assistant|>\n"
img_path = hf_hub_download(repo_id=model_path, filename='example.png')
image = Image.open(img_path).convert("RGB")
print(image)
# Build the inputs to vLLM; the image is passed as `multi_modal_data`.
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image,
}
}
outputs = model.generate(inputs, sampling_params=sampling_params)
print(f"Generated text: {outputs[0].outputs[0].text}")
```
**Model Architecture:**
The architecture of granite-vision-3.1-2b-preview consists of the following components:
(1) Vision encoder: SigLIP (https://huggingface.co/docs/transformers/en/model_doc/siglip).
(2) Vision-language connector: two-layer MLP with gelu activation function.
(3) Large language model: granite-3.1-2b-instruct with 128k context length (https://huggingface.co/ibm-granite/granite-3.1-2b-instruct).
We built upon LlaVA (https://llava-vl.github.io) to train our model. We use multi-layer encoder features and a denser grid resolution in AnyRes to enhance the model's ability to understand nuanced visual content, which is essential for accurately interpreting document images.
**Training Data:**
Overall, our training data is largely comprised of two key sources: (1) publicly available datasets (2) internally created synthetic data targeting specific capabilities including document understanding tasks. A detailed attribution of datasets can be found in the technical report (coming soon).
**Infrastructure:**
We train Granite Vision using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
**Ethical Considerations and Limitations:**
The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. granite-vision-3.1-2b-preview is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying text verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use granite-vision-3.1-2b-preview with ethical intentions and in a responsible way. We recommend using this model for document understanding tasks, and note that more general vision tasks may pose higher inherent risks of triggering biased or harmful output.
**Resources**
- ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
- 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
- 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
<!-- ## Citation
```
@misc{granite-models,
author = {author 1, author2, ...},
title = {},
journal = {},
volume = {},
year = {2024},
url = {https://arxiv.org/abs/0000.00000},
}
``` --> |