Upload folder using huggingface_hub

247943c verified 6 months ago

4.26 kB

	---
	license: apache-2.0
	datasets:
	- lmms-lab/LLaVA-OneVision-Data
	- BAAI/Infinity-MM
	language:
	- en
	- zh
	base_model:
	- google/siglip2-so400m-patch16-512
	- Qwen/Qwen2-1.5B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# FlashVL-2B-Static-GRPO
	[\[📜 FlashVL\]](https://www.arxiv.org/abs/2505.09498)

	![image/png](https://s3plus.meituan.net/automl-datasets/mlm/logo.jpg)

	## Introduction

	We are excited to introduce FlashVL, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.


	### Environment Setup

	```bash
	pip install torch==2.1.2
	pip install transformers==4.50.0.dev0
	```


	### How to use it?

	```python
	import torch
	from PIL import Image
	import requests
	from io import BytesIO
	from transformers import AutoModel, AutoTokenizer, SiglipProcessor

	model_path = "Flash-VL/FlashVL-2B-Static-GRPO"
	model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda')
	model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda')
	model.im_trans = SiglipProcessor.from_pretrained(model_path).image_processor

	image_url ="https://s3plus.meituan.net/automl-datasets/mlm/3FF4.png"
	response = requests.get(image_url)
	image_data = BytesIO(response.content)
	pil_image = Image.open(image_data).convert('RGB')

	messages = [{'role': 'user', 'content': "说说图中第一行第二列是什么蔬菜，买一斤多少钱"}]
	answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
	print(answer)
	# 图片中第一行第二列的蔬菜是荷兰豆，买一斤的价格是￥16.8。
	```

	### Evaluation

	\| Method/model \| Average \| DynaMath \| MathVision \| MathVerse \| MMMU Pro \| WeMath \|
	\| :--------------------: \| :----------------: \| :----------------: \| :----------------: \| :----------------: \| :----------------: \| :----------------: \|
	\| Flash-VL-2B<sub>s</sub> \| 23.80 \| 23.19 \| 26.72 \| 16.84 \| 16.24 \| 36.03 \|
	\| InternVL3-2B \| 27.03 \| 32.55 \| 26.49 \| 17 \| 22.56 \| 36.55 \|
	\| + SFT \| 26.08 (+2.28) \| 28.28 \| 31.06 \| 16.97 \| 15.95 \| 38.16 \|
	\| + RL \| 27.23 (+3.43) \| 26.94 \| 27.94 \| 17.73 \| 16.99 \| 46.55 \|
	\| FlashVL-2B-Static-GRPO\| 29.05 (+5.25) \| 30.61 \| 32.48 \| 18.45 \| 16.53 \| 47.18 \|

	Note: FlashVL-2B-Static-GRPO applies both SFT and RL.


	We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate FlashVL-2B-Static.



	## Citation
	If you find this project useful in your research, please consider citing:

	```BibTeX
	@misc{zhang2025flashvl2boptimizingvisionlanguage,
	title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput},
	author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma},
	year={2025},
	eprint={2505.09498},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2505.09498},
	}
	```