Zhang199
/

TinyLLaVA-Video-Phi2-16-512

Image-Text-to-Text

Model card Files Files and versions Community

TinyLLaVA-Video-Phi2-16-512 / README.md

Zhang199's picture

Update README.md

15e7aef verified 28 days ago

|

history blame contribute delete

1.3 kB

	---
	license: apache-2.0
	pipeline_tag: image-text-to-text
	---

	<center><span style="font-size:2em;">TinyLLaVA-Video</span></center>

	[![arXiv](https://img.shields.io/badge/Arxiv-2402.14289-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2501.15513)[![Github](https://img.shields.io/badge/Github-Github-blue.svg)](https://github.com/ZhangXJ199/TinyLLaVA-Video)


	Here, we introduce TinyLLaVA-Video-Phi2-16-512. For LLM and vision tower, we choose [Phi-2](https://huggingface.co/microsoft/phi-2) and [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384), respectively. The model samples 16 frames from each video and represents the video sequence using 512 tokens.

	### Result
	\| VT (HF Path) \| #Frame/Query \| Video-MME \| MVBench \| LongVideoBench \| MLVU \|
	\| :----------------------------------------: \| ------------ \| ------------- \| ------- \| -------------- \| ---------- \|
	\| [Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512](https://huggingface.co/Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512) \| 16/512 \| 44.7 \| 42.5 \| 37.6 \| 48.1 \|
	\| [Zhang199/TinyLLaVA-Video-Phi2-16-512](https://huggingface.co/Zhang199/TinyLLaVA-Video-Phi2-16-512) \| 16/512 \| 42.7 \| 42.0 \| 42.2 \| 46.5 \|