wren93 commited on
Commit
938b37e
·
verified ·
1 Parent(s): 189073a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -0
README.md CHANGED
@@ -17,7 +17,45 @@ This repo contains model checkpoints for **Vamba-Qwen2-VL-7B**. Vamba is a hybri
17
  The main computation overhead in the transformer-based LMMs comes from the quadratic complexity of the self-attention in the video tokens. To overcome this issue, we design a hybrid Mamba Transformer architecture to process text and video tokens differently. The key idea of our method is to split the expensive self-attention operation over the entire video and text token sequence into two more efficient components. Since video tokens typically dominate the sequence while text tokens remain few, we maintain the self-attention mechanism exclusively for the text tokens and eliminate it for the video tokens. Instead, we add cross-attention layers that use text tokens as queries and video tokens as keys and values. In the meantime, we propose employing Mamba blocks to effectively process the video tokens.
18
 
19
 
 
 
 
 
 
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## Citation
23
  If you find our paper useful, please cite us with
 
17
  The main computation overhead in the transformer-based LMMs comes from the quadratic complexity of the self-attention in the video tokens. To overcome this issue, we design a hybrid Mamba Transformer architecture to process text and video tokens differently. The key idea of our method is to split the expensive self-attention operation over the entire video and text token sequence into two more efficient components. Since video tokens typically dominate the sequence while text tokens remain few, we maintain the self-attention mechanism exclusively for the text tokens and eliminate it for the video tokens. Instead, we add cross-attention layers that use text tokens as queries and video tokens as keys and values. In the meantime, we propose employing Mamba blocks to effectively process the video tokens.
18
 
19
 
20
+ # Quick Start
21
+ ```python
22
+ # git clone https://github.com/TIGER-AI-Lab/Vamba
23
+ # cd Vamba
24
+ # export PYTHONPATH=.
25
 
26
+ from tools.vamba_chat import Vamba
27
+ model = Vamba(model_path="TIGER-Lab/Vamba-Qwen2-VL-7B", device="cuda")
28
+ test_input = [
29
+ {
30
+ "type": "video",
31
+ "content": "assets/magic.mp4",
32
+ "metadata": {
33
+ "video_num_frames": 128,
34
+ "video_sample_type": "middle",
35
+ "img_longest_edge": 640,
36
+ "img_shortest_edge": 256,
37
+ }
38
+ },
39
+ {
40
+ "type": "text",
41
+ "content": "<video> Describe the magic trick."
42
+ }
43
+ ]
44
+ print(model(test_input))
45
+
46
+ test_input = [
47
+ {
48
+ "type": "image",
49
+ "content": "assets/old_man.png",
50
+ "metadata": {}
51
+ },
52
+ {
53
+ "type": "text",
54
+ "content": "<image> Describe this image."
55
+ }
56
+ ]
57
+ print(model(test_input))
58
+ ```
59
 
60
  ## Citation
61
  If you find our paper useful, please cite us with