File size: 2,796 Bytes
a9aceed
 
 
189073a
a9aceed
 
 
 
9b2c439
a9aceed
0bbfef9
a9aceed
 
 
9b2c439
a9aceed
9b2c439
a9aceed
 
b452323
938b37e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a9aceed
 
 
 
0bbfef9
 
 
 
 
 
 
 
 
a9aceed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
license: mit
pipeline_tag: video-text-to-text
library_name: transformers
---

# Vamba

This repo contains model checkpoints for **Vamba-Qwen2-VL-7B**. Vamba is a hybrid Mamba-Transformer model that leverages cross-attention layers and Mamba-2 blocks for efficient hour-long video understanding.

[**🌐 Homepage**](https://tiger-ai-lab.github.io/Vamba/) | [**πŸ“– arXiv**](https://arxiv.org/abs/2503.11579) | [**πŸ’» GitHub**](https://github.com/TIGER-AI-Lab/Vamba) | [**πŸ€— Model**](https://huggingface.co/TIGER-Lab/Vamba-Qwen2-VL-7B)

## Vamba Model Architecture
<p align="center">
<img src="https://tiger-ai-lab.github.io/Vamba/static/images/vamba_main.png" width="900">
</p>
The main computation overhead in the transformer-based LMMs comes from the quadratic complexity of the self-attention in the video tokens. To overcome this issue, we design a hybrid Mamba Transformer architecture to process text and video tokens differently. The key idea of our method is to split the expensive self-attention operation over the entire video and text token sequence into two more efficient components. Since video tokens typically dominate the sequence while text tokens remain few, we maintain the self-attention mechanism exclusively for the text tokens and eliminate it for the video tokens. Instead, we add cross-attention layers that use text tokens as queries and video tokens as keys and values. In the meantime, we propose employing Mamba blocks to effectively process the video tokens.


## Quick Start
```python
# git clone https://github.com/TIGER-AI-Lab/Vamba
# cd Vamba
# export PYTHONPATH=.
from tools.vamba_chat import Vamba
model = Vamba(model_path="TIGER-Lab/Vamba-Qwen2-VL-7B", device="cuda")
test_input = [
    {
        "type": "video",
        "content": "assets/magic.mp4",
        "metadata": {
            "video_num_frames": 128,
            "video_sample_type": "middle",
            "img_longest_edge": 640,
            "img_shortest_edge": 256,
        }
    },
    {
        "type": "text",
        "content": "<video> Describe the magic trick."
    }
]
print(model(test_input))

test_input = [
    {
        "type": "image",
        "content": "assets/old_man.png",
        "metadata": {}
    },
    {
        "type": "text",
        "content": "<image> Describe this image."
    }
]
print(model(test_input))
```

## Citation
If you find our paper useful, please cite us with
```
@misc{ren2025vambaunderstandinghourlongvideos,
      title={Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers}, 
      author={Weiming Ren and Wentao Ma and Huan Yang and Cong Wei and Ge Zhang and Wenhu Chen},
      year={2025},
      eprint={2503.11579},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.11579}, 
}
```