File size: 5,980 Bytes
f4ad318
 
5f22aee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4ad318
 
 
5f22aee
 
 
f4ad318
 
a58031c
f4ad318
 
5f22aee
f4ad318
 
5f22aee
5459e29
 
5f22aee
f4ad318
5f22aee
 
f4ad318
 
 
 
 
5f22aee
 
 
 
 
 
 
f4ad318
5f22aee
f4ad318
5f22aee
 
 
f4ad318
 
 
5f22aee
f4ad318
 
701b3af
f4ad318
5f22aee
f4ad318
5f22aee
 
 
 
f4ad318
5f22aee
f4ad318
5f22aee
 
 
 
 
 
 
 
5bae1fc
 
f4ad318
86c2ff6
5f22aee
86c2ff6
5f22aee
 
 
5bae1fc
 
5f22aee
86c2ff6
5f22aee
f4ad318
5f22aee
 
 
 
 
 
 
 
f4ad318
 
5f22aee
f4ad318
5f22aee
 
 
 
 
465500a
5f22aee
465500a
5f22aee
f4ad318
5f22aee
 
 
 
 
 
 
f4ad318
5f22aee
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
library_name: transformers
tags:
- multi-modal
- large-language-model
- video-language-model
license: apache-2.0
datasets:
- lmms-lab/LLaVA-OneVision-Data
- allenai/pixmo-docs
- HuggingFaceM4/Docmatix
- lmms-lab/LLaVA-Video-178K
- ShareGPT4Video/ShareGPT4Video
language:
- en
metrics:
- accuracy
pipeline_tag: visual-question-answering
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
- DAMO-NLP-SG/VideoLLaMA3-2B-Image
---


<p align="center">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/tt5KYnAUmQlHtfB1-Zisl.png" width="150" style="margin-bottom: 0.2;"/>
<p>


<h3 align="center"><a href="https://arxiv.org/abs/2501.13106">VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3>


<h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">Github</a> for the latest update.  </h5>


## πŸ“° News
<!-- * **[2024.01.23]**  πŸ‘‹πŸ‘‹ Update technical report. If you have works closely related to VideoLLaMA3 but not mentioned in the paper, feel free to let us know. -->
* **[2024.01.24]**  πŸ”₯πŸ”₯ Online Demo is available: [VideoLLaMA3-Image-7B](https://huggingface.co/spaces/lixin4ever/VideoLLaMA3-Image), [VideoLLaMA3-7B](https://huggingface.co/spaces/lixin4ever/VideoLLaMA3).
* **[2024.01.22]**  Release models and inference code of VideoLLaMA 3.

## 🌟 Introduction
VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.





## 🌎 Model Zoo
| Model                | Base Model   | HF Link                                                      |
| -------------------- | ------------ | ------------------------------------------------------------ |
| VideoLLaMA3-7B       | Qwen2.5-7B   | [DAMO-NLP-SG/VideoLLaMA3-7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B) |
| VideoLLaMA3-2B (**This Checkpoint**)       | Qwen2.5-1.5B | [DAMO-NLP-SG/VideoLLaMA3-2B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-2B) |
| VideoLLaMA3-7B-Image | Qwen2.5-7B   | [DAMO-NLP-SG/VideoLLaMA3-7B-Image](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B-Image) |
| VideoLLaMA3-2B-Image | Qwen2.5-1.5B | [DAMO-NLP-SG/VideoLLaMA3-2B-Image](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-2B-Image) |

We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:

| Model                         | Base Model                | HF Link                                                      |
| ----------------------------- | ------------------------- | ------------------------------------------------------------ |
| VideoLLaMA3-7B Vision Encoder | siglip-so400m-patch14-384 | [DAMO-NLP-SG/VL3-SigLIP-NaViT](https://huggingface.co/DAMO-NLP-SG/VL3-SigLIP-NaViT) |



## πŸš€ Main Results


<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/609115c79a8bcaa437b234a9/KDNfFdoKwzPDzX6OwJAi2.png">

* \* denotes the reproduced results.

## πŸ€– Quick Start
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoModel, AutoImageProcessor

model_name = "DAMO-NLP-SG/VideoLLaMA3-2B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
video_path = "put your video path here"
question = "Describe this video in detail."

# Video conversation
conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {"type": "video", "data": {"video_path": video_path, "fps": 1, "max_frames": 128}},
            {"type": "text", "data": question},
        ]
    },
]

inputs = processor(conversation=conversation, return_tensors="pt")
inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(response)
```


## Citation

If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{damonlpsg2025videollama3,
  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
  journal={arXiv preprint arXiv:2501.13106},
  year={2025},
  url = {https://arxiv.org/abs/2501.13106}
}

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}
```