|
--- |
|
license: other |
|
license_name: apple |
|
license_link: https://github.com/apple/ml-fastvlm/blob/main/LICENSE |
|
language: |
|
- en |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- multimodal |
|
library_name: transformers |
|
--- |
|
|
|
# FastVLM-0.5B-Stage2 |
|
|
|
## Introduction |
|
|
|
This is FastVLM-0.5B-Stage2, a multimodal language model that can understand things visually, being agentic, understand long videos and capture events, and generate structured outputs. |
|
|
|
This model is exported from Github [apple/ml-fastvlm](https://github.com/apple/ml-fastvlm). |
|
|
|
Model's weight: [llava-fastvithd_0.5b_stage2.zip](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_0.5b_stage2.zip). |
|
|
|
|
|
### Usage |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
model_id = 'FastVLM-0.5B-Stage2' |
|
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, use_fast=False) |
|
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype='auto', trust_remote_code=True) |
|
``` |
|
|
|
### Export to MNN |
|
```python |
|
git clone https://github.com/alibaba/MNN |
|
cd MNN/transformers/llm/export |
|
python llmexport.py --path /path/to/FastVLM-0.5B-Stage2 --export mnn |
|
``` |
|
|
|
|
|
## Citation |
|
|
|
If you find our work helpful, feel free to give us a cite. |
|
|
|
``` |
|
@InProceedings{fastvlm2025, |
|
author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari}, |
|
title = {FastVLM: Efficient Vision Encoding for Vision Language Models}, |
|
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
|
month = {June}, |
|
year = {2025}, |
|
}{2023} |
|
``` |