Zhang199's picture
Update README.md
15e7aef verified
---
license: apache-2.0
pipeline_tag: image-text-to-text
---
**<center><span style="font-size:2em;">TinyLLaVA-Video</span></center>**
[![arXiv](https://img.shields.io/badge/Arxiv-2402.14289-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2501.15513)[![Github](https://img.shields.io/badge/Github-Github-blue.svg)](https://github.com/ZhangXJ199/TinyLLaVA-Video)
Here, we introduce TinyLLaVA-Video-Phi2-16-512. For LLM and vision tower, we choose [Phi-2](https://huggingface.co/microsoft/phi-2) and [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384), respectively. The model samples 16 frames from each video and represents the video sequence using 512 tokens.
### Result
| VT (HF Path) | #Frame/Query | Video-MME | MVBench | LongVideoBench | MLVU |
| :----------------------------------------: | ------------ | ------------- | ------- | -------------- | ---------- |
| [Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512](https://huggingface.co/Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512) | 16/512 | 44.7 | 42.5 | 37.6 | 48.1 |
| [Zhang199/TinyLLaVA-Video-Phi2-16-512](https://huggingface.co/Zhang199/TinyLLaVA-Video-Phi2-16-512) | 16/512 | 42.7 | 42.0 | 42.2 | 46.5 |