Improve model card with Updates, Model Zoo, and Training information

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +18 -9
README.md CHANGED
@@ -5,9 +5,9 @@ library_name: transformers
5
  license: apache-2.0
6
  metrics:
7
  - accuracy
 
8
  tags:
9
  - multimodal
10
- pipeline_tag: video-text-to-text
11
  model-index:
12
  - name: InternVL2.5_HiCo_R64
13
  results:
@@ -61,23 +61,31 @@ model-index:
61
  value: 66.4
62
  name: accuracy
63
  verified: true
64
-
65
  ---
66
 
67
- # πŸ“•InternVL2.5_HiCo_R64⚑
68
  <!-- [\[πŸ“° Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
69
  [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
70
  [\[πŸ“œ Tech Report\]](https://arxiv.org/abs/2501.12386)
71
  <!-- [\[πŸ—¨οΈ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
72
 
73
- InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (**R64 means 64 tokens per frames**).
 
 
 
 
 
74
 
 
 
 
 
 
 
75
 
 
76
 
77
- ## πŸ“ˆ Performance
78
- | Model | MVBench | LongVideoBench | VideoMME(w/o sub)|
79
- | --- | --- | --- | --- |
80
- |InternVL2.5_HiCo_R64| 74.4 | 62.7 | 66.4|
81
 
82
  ## πŸš€ How to use the model
83
 
@@ -233,7 +241,8 @@ with torch.no_grad():
233
 
234
  pixel_values, num_patches_list = load_video(video_path, num_segments=num_segments, max_num=1, get_frame_by_duration=False)
235
  pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
236
- video_prefix = "".join([f"Frame{i+1}: <image>\n" for i in range(len(num_patches_list))])
 
237
  # single-turn conversation
238
  question1 = "Describe this video in detail."
239
  question = video_prefix + question1
 
5
  license: apache-2.0
6
  metrics:
7
  - accuracy
8
+ pipeline_tag: video-text-to-text
9
  tags:
10
  - multimodal
 
11
  model-index:
12
  - name: InternVL2.5_HiCo_R64
13
  results:
 
61
  value: 66.4
62
  name: accuracy
63
  verified: true
 
64
  ---
65
 
66
+ # πŸ“•InternVideo2.5_HiCo_R64⚑
67
  <!-- [\[πŸ“° Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
68
  [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
69
  [\[πŸ“œ Tech Report\]](https://arxiv.org/abs/2501.12386)
70
  <!-- [\[πŸ—¨οΈ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
71
 
72
+ InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (**R64 means 64 tokens per frames**).
73
+
74
+ ## πŸš€ Updates
75
+ - `2025/06/11`: [InternVideo2.5 (InternVL2.5 + LRC)](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B) achieves an accuracy of 53.2\% on [VideoEval-Pro (MCQ)](https://huggingface.co/spaces/TIGER-Lab/VideoEval-Pro) (Thanks for their benchmark). This result positions InternVideo2.5 as one of the top-performing open-source MLLMs in 7-8B parameter size.
76
+ - `2025/01/23`: [InternVideo2.5 (InternVL2.5 + LRC)](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B) and [InternVL2.5-HiCo](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R16) have been officially released on HuggingFace.
77
+ - `2025/01/22`: The [technical report](https://arxiv.org/pdf/2501.12386) of InternVideo2.5 is released.
78
 
79
+ ## πŸ“ˆ Model Zoo
80
+ | MLLM | Link | MVBench | Perception Test | LongVideoBench | MLVU | VideoMME | LVBench | #Tokens per frame | #Params |
81
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
82
+ | InternVideo2.5 | [huggingface](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B)| 75.7 | 74.9 | 60.6 | 72.8 | 65.1 | 46.4 | 16 | 8B |
83
+ | InternVL2.5 + HiCo | [huggingface](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R16) | 74.0 | 71.4 | 59.6 | 71.5 | 64.9 | - | 16 | 8B |
84
+ | InternVL2.5 + HiCo | [huggingface](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R64) | 74.4 | 71.9 | 62.7 | 72.6 | 66.4 | - | 64 | 8B |
85
 
86
+ ## βš™οΈ Training
87
 
88
+ See [Finetuning Code](https://github.com/OpenGVLab/VideoChat-Flash/tree/main/xtuner-train_internvideo2_5).
 
 
 
89
 
90
  ## πŸš€ How to use the model
91
 
 
241
 
242
  pixel_values, num_patches_list = load_video(video_path, num_segments=num_segments, max_num=1, get_frame_by_duration=False)
243
  pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
244
+ video_prefix = "".join([f"Frame{i+1}: <image>
245
+ " for i in range(len(num_patches_list))])
246
  # single-turn conversation
247
  question1 = "Describe this video in detail."
248
  question = video_prefix + question1