|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- lmms-lab/LLaVA-OneVision-Data |
|
|
- BAAI/Infinity-MM |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
base_model: |
|
|
- google/siglip2-so400m-patch16-512 |
|
|
- Qwen/Qwen2-1.5B-Instruct |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# FlashVL-2B-Static-GRPO |
|
|
[\[📜 FlashVL\]](https://www.arxiv.org/abs/2505.09498) |
|
|
|
|
|
 |
|
|
|
|
|
## Introduction |
|
|
|
|
|
We are excited to introduce **FlashVL**, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications. |
|
|
|
|
|
|
|
|
### Environment Setup |
|
|
|
|
|
```bash |
|
|
pip install torch==2.1.2 |
|
|
pip install transformers==4.50.0.dev0 |
|
|
``` |
|
|
|
|
|
|
|
|
### How to use it? |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
import requests |
|
|
from io import BytesIO |
|
|
from transformers import AutoModel, AutoTokenizer, SiglipProcessor |
|
|
|
|
|
model_path = "Flash-VL/FlashVL-2B-Static-GRPO" |
|
|
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda') |
|
|
model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda') |
|
|
model.im_trans = SiglipProcessor.from_pretrained(model_path).image_processor |
|
|
|
|
|
image_url ="https://s3plus.meituan.net/automl-datasets/mlm/3FF4.png" |
|
|
response = requests.get(image_url) |
|
|
image_data = BytesIO(response.content) |
|
|
pil_image = Image.open(image_data).convert('RGB') |
|
|
|
|
|
messages = [{'role': 'user', 'content': "说说图中第一行第二列是什么蔬菜,买一斤多少钱"}] |
|
|
answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256) |
|
|
print(answer) |
|
|
# 图片中第一行第二列的蔬菜是**荷兰豆**,买一斤的价格是**¥16.8**。 |
|
|
``` |
|
|
|
|
|
### Evaluation |
|
|
|
|
|
| Method/model | Average | DynaMath | MathVision | MathVerse | MMMU Pro | WeMath | |
|
|
| :--------------------: | :----------------: | :----------------: | :----------------: | :----------------: | :----------------: | :----------------: | |
|
|
| Flash-VL-2B<sub>s</sub> | 23.80 | 23.19 | 26.72 | 16.84 | 16.24 | 36.03 | |
|
|
| InternVL3-2B | 27.03 | 32.55 | 26.49 | 17 | 22.56 | 36.55 | |
|
|
| + SFT | 26.08 (+2.28) | 28.28 | 31.06 | 16.97 | 15.95 | 38.16 | |
|
|
| + RL | 27.23 (+3.43) | 26.94 | 27.94 | 17.73 | 16.99 | 46.55 | |
|
|
| FlashVL-2B-Static-GRPO| 29.05 (+5.25) | 30.61 | 32.48 | 18.45 | 16.53 | 47.18 | |
|
|
|
|
|
Note: FlashVL-2B-Static-GRPO applies both SFT and RL. |
|
|
|
|
|
|
|
|
We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate FlashVL-2B-Static. |
|
|
|
|
|
|
|
|
|
|
|
## Citation |
|
|
If you find this project useful in your research, please consider citing: |
|
|
|
|
|
```BibTeX |
|
|
@misc{zhang2025flashvl2boptimizingvisionlanguage, |
|
|
title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput}, |
|
|
author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma}, |
|
|
year={2025}, |
|
|
eprint={2505.09498}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2505.09498}, |
|
|
} |
|
|
``` |