|
---
|
|
license: apple-amlr
|
|
license_name: apple-ascl
|
|
license_link: https://github.com/apple/ml-fastvlm/blob/main/LICENSE_MODEL
|
|
library_name: ml-fastvlm
|
|
tags:
|
|
- transformers
|
|
---
|
|
# FastVLM: Efficient Vision Encoding for Vision Language Models
|
|
|
|
FastVLM was introduced in
|
|
**[FastVLM: Efficient Vision Encoding for Vision Language Models](https://www.arxiv.org/abs/2412.13303). (CVPR 2025)**
|
|
|
|
[//]: # ()
|
|
<p align="center">
|
|
<img src="acc_vs_latency_qwen-2.png" alt="Accuracy vs latency figure." width="400"/>
|
|
</p>
|
|
|
|
### Highlights
|
|
* We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
|
|
* Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
|
|
* Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.
|
|
|
|
|
|
### Evaluations
|
|
| Benchmark | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B |
|
|
|:--------------|:------------:|:------------:|:----------:|
|
|
| Ai2D | 68.0 | 77.4 | 83.6 |
|
|
| ScienceQA | 85.2 | 94.4 | 96.7 |
|
|
| MMMU | 33.9 | 37.8 | 45.4 |
|
|
| VQAv2 | 76.3 | 79.1 | 80.8 |
|
|
| ChartQA | 76.0 | 80.1 | 85.0 |
|
|
| TextVQA | 64.5 | 70.4 | 74.9 |
|
|
| InfoVQA | 46.4 | 59.7 | 75.8 |
|
|
| DocVQA | 82.5 | 88.3 | 93.2 |
|
|
| OCRBench | 63.9 | 70.2 | 73.1 |
|
|
| RealWorldQA | 56.1 | 61.2 | 67.2 |
|
|
| SeedBench-Img | 71.0 | 74.2 | 75.4 |
|
|
|
|
|
|
### Usage Example
|
|
To run inference of PyTorch checkpoint, follow the instruction in the official repo:
|
|
|
|
Download the model
|
|
```
|
|
huggingface-cli download apple/FastVLM-0.5B
|
|
```
|
|
|
|
Run inference using `predict.py` from the official repo.
|
|
```bash
|
|
python predict.py --model-path /path/to/checkpoint-dir \
|
|
--image-file /path/to/image.png \
|
|
--prompt "Describe the image."
|
|
```
|
|
|
|
### Run inference with Transformers (Remote Code)
|
|
To run inference with transformers we can leverage `trust_remote_code` along with the following snippet:
|
|
|
|
```python
|
|
import torch
|
|
from PIL import Image
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
MID = "apple/FastVLM-0.5B"
|
|
IMAGE_TOKEN_INDEX = -200 # what the model code looks for
|
|
|
|
# Load
|
|
tok = AutoTokenizer.from_pretrained(MID, trust_remote_code=True)
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
MID,
|
|
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
|
|
device_map="auto",
|
|
trust_remote_code=True,
|
|
)
|
|
|
|
# Build chat -> render to string (not tokens) so we can place <image> exactly
|
|
messages = [
|
|
{"role": "user", "content": "<image>\nDescribe this image in detail."}
|
|
]
|
|
rendered = tok.apply_chat_template(
|
|
messages, add_generation_prompt=True, tokenize=False
|
|
)
|
|
|
|
pre, post = rendered.split("<image>", 1)
|
|
|
|
# Tokenize the text *around* the image token (no extra specials!)
|
|
pre_ids = tok(pre, return_tensors="pt", add_special_tokens=False).input_ids
|
|
post_ids = tok(post, return_tensors="pt", add_special_tokens=False).input_ids
|
|
|
|
# Splice in the IMAGE token id (-200) at the placeholder position
|
|
img_tok = torch.tensor([[IMAGE_TOKEN_INDEX]], dtype=pre_ids.dtype)
|
|
input_ids = torch.cat([pre_ids, img_tok, post_ids], dim=1).to(model.device)
|
|
attention_mask = torch.ones_like(input_ids, device=model.device)
|
|
|
|
# Preprocess image via the model's own processor
|
|
img = Image.open("test-2.jpg").convert("RGB")
|
|
px = model.get_vision_tower().image_processor(images=img, return_tensors="pt")["pixel_values"]
|
|
px = px.to(model.device, dtype=model.dtype)
|
|
|
|
# Generate
|
|
with torch.no_grad():
|
|
out = model.generate(
|
|
inputs=input_ids,
|
|
attention_mask=attention_mask,
|
|
images=px,
|
|
max_new_tokens=128,
|
|
)
|
|
|
|
print(tok.decode(out[0], skip_special_tokens=True))
|
|
```
|
|
|
|
## Citation
|
|
If you found this model useful, please cite the following paper:
|
|
```
|
|
@InProceedings{fastvlm2025,
|
|
author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
|
|
title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
|
|
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
|
month = {June},
|
|
year = {2025},
|
|
}
|
|
``` |