metadata
license: apple-amlr
library_name: ml-fastvlm
FastVLM: Efficient Vision Encoding for Vision Language Models
FastVLM was introduced in FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025)
Highlights
- We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
- Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
- Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.
Evaluations
Benchmark | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B |
---|---|---|---|
Ai2D | 68.0 | 77.4 | 83.6 |
ScienceQA | 85.2 | 94.4 | 96.7 |
MMMU | 33.9 | 37.8 | 45.4 |
VQAv2 | 76.3 | 79.1 | 80.8 |
ChartQA | 76.0 | 80.1 | 85.0 |
TextVQA | 64.5 | 70.4 | 74.9 |
InfoVQA | 46.4 | 59.7 | 75.8 |
DocVQA | 82.5 | 88.3 | 93.2 |
OCRBench | 63.9 | 70.2 | 73.1 |
RealWorldQA | 56.1 | 61.2 | 67.2 |
SeedBench-Img | 71.0 | 74.2 | 75.4 |
Usage Example
The model has been exported to run with MLX. Follow the instructions in the official repository to use it in an iOS or macOS app.
Citation
If you found this model useful, please cite the following paper:
@InProceedings{fastvlm2025,
author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
}