Spaces:

merve
/

vision_papers

Running

App Files Files Community

vision_papers / pages /Video-LLaVA /Video-LLaVA.md

lbourdois

Upload 174 files

94e735e verified 7 months ago

preview code

raw

history blame

1.6 kB

	We have recently merged Video-LLaVA to @huggingface transformers! 🤗
	🎞️ What makes this model different? keep reading ⇊

	![video](video_1.mp4)

	[Demo](https://t.co/MVP14uEj9e) \| [Model](https://t.co/oqSCMUqwJo)
	See below how to initialize the model and processor and infer ⬇️


	![image_1](image_1.jpg)

	Compared to other models that take image and video input and either project them separately or downsampling video and projecting selected frames, Video-LLaVA is converting images and videos to unified representation and project them using a shared projection layer.

	![image_2](image_2.jpg)

	It uses Vicuna 1.5 as the language model and LanguageBind's own encoders that's based on OpenCLIP, these encoders project the modalities to an unified representation before passing to projection layer.

	![image_3](image_3.jpg)

	I feel like one of the coolest features of this model is the joint understanding which is also introduced recently with many models it's a relatively older model but ahead of it's time and works very well!

	![image_4](image_4.jpg)

	> [!TIP]
	Ressources:
	[Video-LLaVA: Learning United Visual Representation by Alignment Before Projection](https://arxiv.org/abs/2311.10122)
	by Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan (2023)
	[GitHub](https://github.com/PKU-YuanGroup/Video-LLaVA)
	[Hugging Face documentation](https://huggingface.co/docs/transformers/main/en/model_doc/video_llava)

	> [!NOTE]
	[Original tweet](https://x.com/mervenoyann/status/1816427325073842539) (July 25, 2024)