|
--- |
|
license: apache-2.0 |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
**<center><span style="font-size:2em;">TinyLLaVA-Video</span></center>** |
|
|
|
[](https://arxiv.org/abs/2501.15513)[](https://github.com/ZhangXJ199/TinyLLaVA-Video) |
|
|
|
|
|
Here, we introduce TinyLLaVA-Video-Phi2-16-512. For LLM and vision tower, we choose [Phi-2](https://huggingface.co/microsoft/phi-2) and [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384), respectively. The model samples 16 frames from each video and represents the video sequence using 512 tokens. |
|
|
|
### Result |
|
| VT (HF Path) | #Frame/Query | Video-MME | MVBench | LongVideoBench | MLVU | |
|
| :----------------------------------------: | ------------ | ------------- | ------- | -------------- | ---------- | |
|
| [Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512](https://huggingface.co/Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512) | 16/512 | 44.7 | 42.5 | 37.6 | 48.1 | |
|
| [Zhang199/TinyLLaVA-Video-Phi2-16-512](https://huggingface.co/Zhang199/TinyLLaVA-Video-Phi2-16-512) | 16/512 | 42.7 | 42.0 | 42.2 | 46.5 | |
|
|