OpenGVLab
/

InternVL_2_5_HiCo_R64

@@ -5,9 +5,9 @@ library_name: transformers
 license: apache-2.0
 metrics:
 - accuracy
 tags:
 - multimodal
-pipeline_tag: video-text-to-text
 model-index:
 - name: InternVL2.5_HiCo_R64
   results:
@@ -61,23 +61,31 @@ model-index:
       value: 66.4
       name: accuracy
       verified: true
 ---
-# 📕InternVL2.5_HiCo_R64⚡
 <!-- [\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
 [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
 [\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
 <!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
- InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (**R64 means 64 tokens per frames**).
-## 📈 Performance
-| Model |  MVBench | LongVideoBench |  VideoMME(w/o sub)|
-| ---   |  ---     |   ---            | ---     |
-|InternVL2.5_HiCo_R64| 74.4 |  62.7   | 66.4|
 ## 🚀 How to use the model
@@ -233,7 +241,8 @@ with torch.no_grad():
   pixel_values, num_patches_list = load_video(video_path, num_segments=num_segments, max_num=1, get_frame_by_duration=False)
   pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
-  video_prefix = "".join([f"Frame{i+1}: <image>\n" for i in range(len(num_patches_list))])
   # single-turn conversation
   question1 = "Describe this video in detail."
   question = video_prefix + question1

 license: apache-2.0
 metrics:
 - accuracy
+pipeline_tag: video-text-to-text
 tags:
 - multimodal
 model-index:
 - name: InternVL2.5_HiCo_R64
   results:
       value: 66.4
       name: accuracy
       verified: true
 ---
+# 📕InternVideo2.5_HiCo_R64⚡
 <!-- [\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
 [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
 [\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
 <!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
+InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (**R64 means 64 tokens per frames**).
+## 🚀 Updates
+- `2025/06/11`: [InternVideo2.5 (InternVL2.5 + LRC)](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B) achieves an accuracy of 53.2\% on [VideoEval-Pro (MCQ)](https://huggingface.co/spaces/TIGER-Lab/VideoEval-Pro) (Thanks for their benchmark). This result positions InternVideo2.5 as one of the top-performing open-source MLLMs in 7-8B parameter size.
+- `2025/01/23`: [InternVideo2.5 (InternVL2.5 + LRC)](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B) and [InternVL2.5-HiCo](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R16) have been officially released on HuggingFace.
+- `2025/01/22`: The [technical report](https://arxiv.org/pdf/2501.12386) of InternVideo2.5 is released.
+## 📈 Model Zoo
+| MLLM | Link | MVBench | Perception Test | LongVideoBench | MLVU | VideoMME | LVBench | #Tokens per frame | #Params |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| InternVideo2.5 | [huggingface](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B)| 75.7 | 74.9 | 60.6 | 72.8 | 65.1 | 46.4 | 16 | 8B |
+| InternVL2.5 + HiCo | [huggingface](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R16) | 74.0 | 71.4 | 59.6 | 71.5 | 64.9 | - | 16 | 8B |
+| InternVL2.5 + HiCo | [huggingface](https://huggingface.co/OpenGVLab/InternVL_2_5_HiCo_R64) | 74.4 | 71.9 | 62.7 | 72.6 | 66.4 | - | 64 | 8B |
+## ⚙️ Training
+See [Finetuning Code](https://github.com/OpenGVLab/VideoChat-Flash/tree/main/xtuner-train_internvideo2_5).
 ## 🚀 How to use the model
   pixel_values, num_patches_list = load_video(video_path, num_segments=num_segments, max_num=1, get_frame_by_duration=False)
   pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
+  video_prefix = "".join([f"Frame{i+1}: <image>
+" for i in range(len(num_patches_list))])
   # single-turn conversation
   question1 = "Describe this video in detail."
   question = video_prefix + question1