BIMBA: Selective-Scan Compression for Long-Range Video Question Answering Paper • 2503.09590 • Published 1 day ago • 2
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering Paper • 2503.09590 • Published 1 day ago • 2 • 2
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos Paper • 2409.07450 • Published Sep 11, 2024 • 11
Video ReCap: Recursive Captioning of Hour-Long Videos Paper • 2402.13250 • Published Feb 20, 2024 • 26
Is Space-Time Attention All You Need for Video Understanding? Paper • 2102.05095 • Published Feb 9, 2021 • 1
SimpleClick: Interactive Image Segmentation with Simple Vision Transformers Paper • 2210.11006 • Published Oct 20, 2022
Unified Coarse-to-Fine Alignment for Video-Text Retrieval Paper • 2309.10091 • Published Sep 18, 2023
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs Paper • 2101.12059 • Published Jan 28, 2021
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences Paper • 2401.10529 • Published Jan 19, 2024 • 1
Video ReCap: Recursive Captioning of Hour-Long Videos Paper • 2402.13250 • Published Feb 20, 2024 • 26