File size: 4,747 Bytes
5c0386d
 
6cd9013
 
 
 
 
5c0386d
6cd9013
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---

license: apple-amlr
license_name: apple-ascl
license_link: https://github.com/apple/ml-fastvlm/blob/main/LICENSE_MODEL
library_name: ml-fastvlm
tags:
- transformers
---

# FastVLM: Efficient Vision Encoding for Vision Language Models

FastVLM was introduced in
**[FastVLM: Efficient Vision Encoding for Vision Language Models](https://www.arxiv.org/abs/2412.13303). (CVPR 2025)**

[//]: # (![FastViTHD Performance](acc_vs_latency_qwen-2.png))
<p align="center">
<img src="acc_vs_latency_qwen-2.png" alt="Accuracy vs latency figure." width="400"/>
</p>

### Highlights
* We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.  
* Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
* Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.


### Evaluations
| Benchmark     | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B |
|:--------------|:------------:|:------------:|:----------:|
| Ai2D          |     68.0     |     77.4     |    83.6    |
| ScienceQA     |     85.2     |     94.4     |    96.7    |
| MMMU          |     33.9     |     37.8     |    45.4    |
| VQAv2         |     76.3     |     79.1     |    80.8    |
| ChartQA       |     76.0     |     80.1     |    85.0    |
| TextVQA       |     64.5     |     70.4     |    74.9    |
| InfoVQA       |     46.4     |     59.7     |    75.8    |
| DocVQA        |     82.5     |     88.3     |    93.2    |
| OCRBench      |     63.9     |     70.2     |    73.1    |
| RealWorldQA   |     56.1     |     61.2     |    67.2    |
| SeedBench-Img |     71.0     |     74.2     |    75.4    |


### Usage Example
To run inference of PyTorch checkpoint, follow the instruction in the official repo:

Download the model
```

huggingface-cli download apple/FastVLM-0.5B

``` 

Run inference using `predict.py` from the official repo.
```bash

python predict.py --model-path /path/to/checkpoint-dir \

                  --image-file /path/to/image.png \

                  --prompt "Describe the image."

```

### Run inference with Transformers (Remote Code)
To run inference with transformers we can leverage `trust_remote_code` along with the following snippet:

```python

import torch

from PIL import Image

from transformers import AutoTokenizer, AutoModelForCausalLM



MID = "apple/FastVLM-0.5B"

IMAGE_TOKEN_INDEX = -200  # what the model code looks for



# Load

tok = AutoTokenizer.from_pretrained(MID, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(

    MID,

    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,

    device_map="auto",

    trust_remote_code=True,

)



# Build chat -> render to string (not tokens) so we can place <image> exactly

messages = [

    {"role": "user", "content": "<image>\nDescribe this image in detail."}

]

rendered = tok.apply_chat_template(

    messages, add_generation_prompt=True, tokenize=False

)



pre, post = rendered.split("<image>", 1)



# Tokenize the text *around* the image token (no extra specials!)

pre_ids  = tok(pre,  return_tensors="pt", add_special_tokens=False).input_ids

post_ids = tok(post, return_tensors="pt", add_special_tokens=False).input_ids



# Splice in the IMAGE token id (-200) at the placeholder position

img_tok = torch.tensor([[IMAGE_TOKEN_INDEX]], dtype=pre_ids.dtype)

input_ids = torch.cat([pre_ids, img_tok, post_ids], dim=1).to(model.device)

attention_mask = torch.ones_like(input_ids, device=model.device)



# Preprocess image via the model's own processor

img = Image.open("test-2.jpg").convert("RGB")

px = model.get_vision_tower().image_processor(images=img, return_tensors="pt")["pixel_values"]

px = px.to(model.device, dtype=model.dtype)



# Generate

with torch.no_grad():

    out = model.generate(

        inputs=input_ids,

        attention_mask=attention_mask,

        images=px,

        max_new_tokens=128,

    )



print(tok.decode(out[0], skip_special_tokens=True))

```

## Citation
If you found this model useful, please cite the following paper:
```

@InProceedings{fastvlm2025,

  author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},

  title = {FastVLM: Efficient Vision Encoding for Vision Language Models},

  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

  month = {June},

  year = {2025},

}

```