Spaces:
Running
Running
File size: 1,537 Bytes
94e735e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
Parameter-free LLaVA for video captioning works like magic! 🤩 Let's take a look!
data:image/s3,"s3://crabby-images/540d4/540d4c412cd0cdb532c1e10493c4548a6a2337e6" alt="image_1"
Most of the video captioning models work by downsampling video frames to reduce computational complexity and memory requirements without losing a lot of information in the process.
PLLaVA on the other hand, uses pooling! 🤩
How? 🧐 It takes in frames of video, passed to ViT and then projection layer, and then output goes through average pooling where input shape is (# frames, width, height, text decoder input dim) 👇
data:image/s3,"s3://crabby-images/45d8c/45d8c917421fe1067e0fb26ab3d5049b823c5d99" alt="image_2"
Pooling operation surprisingly reduces the loss of spatial and temporal information. See below some examples on how it can capture the details 🤗
data:image/s3,"s3://crabby-images/46f32/46f32354347f5b96737e589df599fb87f43ad554" alt="image_3"
according to authors' findings, it performs way better than many of the existing models (including proprietary VLMs) and scales very well (on text decoder)
data:image/s3,"s3://crabby-images/bd2ab/bd2ab1d0de6f4072fbeac31825d6e514147a5e2a" alt="image_4"
Model repositories 🤗 [7B](https://t.co/AeSdYsz1U7), [13B](https://t.co/GnI1niTxO7), [34B](https://t.co/HWAM0ZzvDc)
Spaces🤗 [7B](https://t.co/Oms2OLkf7O), [13B](https://t.co/C2RNVNA4uR)
> [!TIP]
Ressources:
[PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning](https://arxiv.org/abs/2404.16994)
by Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng (2024)
[GitHub](https://github.com/magic-research/PLLaVA)
> [!NOTE]
[Original tweet](https://twitter.com/mervenoyann/status/1786336055425138939) (May 3, 2024) |