Improve model card with Updates, Model Zoo, and Training information
#3
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -5,9 +5,9 @@ library_name: transformers
|
|
5 |
license: apache-2.0
|
6 |
metrics:
|
7 |
- accuracy
|
|
|
8 |
tags:
|
9 |
- multimodal
|
10 |
-
pipeline_tag: video-text-to-text
|
11 |
model-index:
|
12 |
- name: InternVL2.5_HiCo_R64
|
13 |
results:
|
@@ -61,23 +61,31 @@ model-index:
|
|
61 |
value: 66.4
|
62 |
name: accuracy
|
63 |
verified: true
|
64 |
-
|
65 |
---
|
66 |
|
67 |
-
# π
|
68 |
<!-- [\[π° Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
|
69 |
[\[π GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
|
70 |
[\[π Tech Report\]](https://arxiv.org/abs/2501.12386)
|
71 |
<!-- [\[π¨οΈ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
|
72 |
|
73 |
-
|
|
|
|
|
|
|
|
|
|
|
74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
|
|
|
76 |
|
77 |
-
|
78 |
-
| Model | MVBench | LongVideoBench | VideoMME(w/o sub)|
|
79 |
-
| --- | --- | --- | --- |
|
80 |
-
|InternVL2.5_HiCo_R64| 74.4 | 62.7 | 66.4|
|
81 |
|
82 |
## π How to use the model
|
83 |
|
@@ -233,7 +241,8 @@ with torch.no_grad():
|
|
233 |
|
234 |
pixel_values, num_patches_list = load_video(video_path, num_segments=num_segments, max_num=1, get_frame_by_duration=False)
|
235 |
pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
|
236 |
-
video_prefix = "".join([f"Frame{i+1}: <image
|
|
|
237 |
# single-turn conversation
|
238 |
question1 = "Describe this video in detail."
|
239 |
question = video_prefix + question1
|
|
|
5 |
license: apache-2.0
|
6 |
metrics:
|
7 |
- accuracy
|
8 |
+
pipeline_tag: video-text-to-text
|
9 |
tags:
|
10 |
- multimodal
|
|
|
11 |
model-index:
|
12 |
- name: InternVL2.5_HiCo_R64
|
13 |
results:
|
|
|
61 |
value: 66.4
|
62 |
name: accuracy
|
63 |
verified: true
|
|
|
64 |
---
|
65 |
|
66 |
+
# πInternVideo2.5_HiCo_R64β‘
|
67 |
<!-- [\[π° Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
|
68 |
[\[π GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
|
69 |
[\[π Tech Report\]](https://arxiv.org/abs/2501.12386)
|
70 |
<!-- [\[π¨οΈ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
|
71 |
|
72 |
+
InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (**R64 means 64 tokens per frames**).
|
73 |
+
|
74 |
+
## π Updates
|
75 |
+
- `2025/06/11`: [InternVideo2.5 (InternVL2.5 + LRC)](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B) achieves an accuracy of 53.2\% on [VideoEval-Pro (MCQ)](https://huggingface.co/spaces/TIGER-Lab/VideoEval-Pro) (Thanks for their benchmark). This result positions InternVideo2.5 as one of the top-performing open-source MLLMs in 7-8B parameter size.
|
76 |
+
- `2025/01/23`: [InternVideo2.5 (InternVL2.5 + LRC)](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B) and [InternVL2.5-HiCo](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R16) have been officially released on HuggingFace.
|
77 |
+
- `2025/01/22`: The [technical report](https://arxiv.org/pdf/2501.12386) of InternVideo2.5 is released.
|
78 |
|
79 |
+
## π Model Zoo
|
80 |
+
| MLLM | Link | MVBench | Perception Test | LongVideoBench | MLVU | VideoMME | LVBench | #Tokens per frame | #Params |
|
81 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
82 |
+
| InternVideo2.5 | [huggingface](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B)| 75.7 | 74.9 | 60.6 | 72.8 | 65.1 | 46.4 | 16 | 8B |
|
83 |
+
| InternVL2.5 + HiCo | [huggingface](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R16) | 74.0 | 71.4 | 59.6 | 71.5 | 64.9 | - | 16 | 8B |
|
84 |
+
| InternVL2.5 + HiCo | [huggingface](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R64) | 74.4 | 71.9 | 62.7 | 72.6 | 66.4 | - | 64 | 8B |
|
85 |
|
86 |
+
## βοΈ Training
|
87 |
|
88 |
+
See [Finetuning Code](https://github.com/OpenGVLab/VideoChat-Flash/tree/main/xtuner-train_internvideo2_5).
|
|
|
|
|
|
|
89 |
|
90 |
## π How to use the model
|
91 |
|
|
|
241 |
|
242 |
pixel_values, num_patches_list = load_video(video_path, num_segments=num_segments, max_num=1, get_frame_by_duration=False)
|
243 |
pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
|
244 |
+
video_prefix = "".join([f"Frame{i+1}: <image>
|
245 |
+
" for i in range(len(num_patches_list))])
|
246 |
# single-turn conversation
|
247 |
question1 = "Describe this video in detail."
|
248 |
question = video_prefix + question1
|