Spaces:
Running
Running
We have recently merged Video-LLaVA to @huggingface transformers! 🤗 | |
🎞️ What makes this model different? keep reading ⇊ | |
data:image/s3,"s3://crabby-images/27930/279302235226c2eeaafc04b8144e4986e54d3746" alt="video" | |
[Demo](https://t.co/MVP14uEj9e) | [Model](https://t.co/oqSCMUqwJo) | |
See below how to initialize the model and processor and infer ⬇️ | |
data:image/s3,"s3://crabby-images/45754/45754a896182261ef4df7f5e2b12cda176752781" alt="image_1" | |
Compared to other models that take image and video input and either project them separately or downsampling video and projecting selected frames, Video-LLaVA is converting images and videos to unified representation and project them using a shared projection layer. | |
data:image/s3,"s3://crabby-images/a0411/a04118aa75c2ecf7bfd643692e5d54d48dbd3ad5" alt="image_2" | |
It uses Vicuna 1.5 as the language model and LanguageBind's own encoders that's based on OpenCLIP, these encoders project the modalities to an unified representation before passing to projection layer. | |
data:image/s3,"s3://crabby-images/a5bd4/a5bd494b6da513a12a0e19f66dde138b97b1838a" alt="image_3" | |
I feel like one of the coolest features of this model is the joint understanding which is also introduced recently with many models it's a relatively older model but ahead of it's time and works very well! | |
data:image/s3,"s3://crabby-images/43d9b/43d9b97c92343ca382136c482ca8c0d66069e166" alt="image_4" | |
> [!TIP] | |
Ressources: | |
[Video-LLaVA: Learning United Visual Representation by Alignment Before Projection](https://arxiv.org/abs/2311.10122) | |
by Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan (2023) | |
[GitHub](https://github.com/PKU-YuanGroup/Video-LLaVA) | |
[Hugging Face documentation](https://huggingface.co/docs/transformers/main/en/model_doc/video_llava) | |
> [!NOTE] | |
[Original tweet](https://x.com/mervenoyann/status/1816427325073842539) (July 25, 2024) |