FastVLM-0.5B-fp16 / README.md

Remove redundant license fields from metadata (#1)

d241b8a verified 3 days ago

2.37 kB

	---
	license: apple-amlr
	library_name: ml-fastvlm
	---
	# FastVLM: Efficient Vision Encoding for Vision Language Models

	FastVLM was introduced in
	[FastVLM: Efficient Vision Encoding for Vision Language Models](https://www.arxiv.org/abs/2412.13303). (CVPR 2025)

	[//]: # (![FastViTHD Performance](acc_vs_latency_qwen-2.png))
	<p align="center">
	<img src="acc_vs_latency_qwen-2.png" alt="Accuracy vs latency figure." width="400"/>
	</p>

	### Highlights
	* We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
	* Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
	* Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.


	### Evaluations
	\| Benchmark \| FastVLM-0.5B \| FastVLM-1.5B \| FastVLM-7B \|
	\|:--------------\|:------------:\|:------------:\|:----------:\|
	\| Ai2D \| 68.0 \| 77.4 \| 83.6 \|
	\| ScienceQA \| 85.2 \| 94.4 \| 96.7 \|
	\| MMMU \| 33.9 \| 37.8 \| 45.4 \|
	\| VQAv2 \| 76.3 \| 79.1 \| 80.8 \|
	\| ChartQA \| 76.0 \| 80.1 \| 85.0 \|
	\| TextVQA \| 64.5 \| 70.4 \| 74.9 \|
	\| InfoVQA \| 46.4 \| 59.7 \| 75.8 \|
	\| DocVQA \| 82.5 \| 88.3 \| 93.2 \|
	\| OCRBench \| 63.9 \| 70.2 \| 73.1 \|
	\| RealWorldQA \| 56.1 \| 61.2 \| 67.2 \|
	\| SeedBench-Img \| 71.0 \| 74.2 \| 75.4 \|


	### Usage Example
	The model has been exported to run with MLX. Follow the instructions in the official repository to use it in an iOS or macOS app.


	## Citation
	If you found this model useful, please cite the following paper:
	```
	@InProceedings{fastvlm2025,
	author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
	title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
	booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	month = {June},
	year = {2025},
	}
	```