Image-Text-to-Text
Transformers
Safetensors
multilingual
internvl
custom_code
conversational
nielsr HF Staff commited on
Commit
fcfe416
·
verified ·
1 Parent(s): 2575a33

Improve InternVL3_5-4B Model Card

Browse files

This PR improves the model card for InternVL3_5-4B by:

- Adding a prominent link to the paper "InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency".
- Organizing the "Quick Start" and "Deployment" sections for better clarity, separating `transformers` usage from deployment solutions like `LMDeploy`.
- Adding a new "Multimodal Datasets" section to highlight relevant datasets used by the model.
- Adding a helpful tip for finetuning these models.

The metadata (`pipeline_tag`, `library_name`, `license`) remains unchanged as it is already correct and comprehensive.

Files changed (1) hide show
  1. README.md +47 -30
README.md CHANGED
@@ -1,22 +1,24 @@
1
  ---
2
- license: apache-2.0
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternVL3_5-4B-MPO
7
- base_model_relation: finetune
8
  datasets:
9
- - OpenGVLab/MMPR-v1.2
10
- - OpenGVLab/MMPR-Tiny
11
  language:
12
- - multilingual
 
 
 
13
  tags:
14
- - internvl
15
- - custom_code
 
16
  ---
17
 
18
  # InternVL3_5-4B
19
 
 
 
20
  [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](https://huggingface.co/papers/2508.18265)
21
 
22
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
@@ -27,13 +29,13 @@ tags:
27
 
28
  ## Introduction
29
 
30
- We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
31
 
32
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
33
 
34
  > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial.
35
 
36
- See [quick start](#quick-start) for how to use our model.
37
 
38
  ## InternVL3.5 Family
39
 
@@ -137,11 +139,11 @@ The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained
137
 
138
 
139
  `InternVL3.5-Flash`:
140
- Compared to InternVL3.5, InternVL3.5-Flash further integrates the *Visual Resolution Router (ViR)*, thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios.
141
  Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
142
  In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
143
  For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
144
- Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5.
145
 
146
 
147
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/architecture.jpg)
@@ -156,8 +158,8 @@ $$
156
  \mathcal{L}_{i}=-\log p_\theta\left(x_i \mid x_1, \ldots, x_{i-1}\right),
157
  $$
158
 
159
- where \\(x_i\\) is the predicted token and prefix tokens in \\(\{x_1, x_2, \ldots, x_{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss.
160
- Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows:
161
 
162
  $$
163
  \mathcal{L}_{i}^{'} = \frac{w_i}{\sum_j w_j} \cdot \mathcal{L}_i, \quad w_i = \frac{1}{N^{0.5}},
@@ -167,20 +169,20 @@ where \\(N\\) denotes the number of tokens in the training sample on which the l
167
 
168
  ### Supervised Fine-Tuning
169
 
170
- During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information.
171
- Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources:
172
 
173
- (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks.
174
 
175
- (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks.
176
 
177
  (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect
178
 
179
  ### Cascade Reinforcement Learning
180
 
181
  Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner.
182
- Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage.
183
- Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost.
184
 
185
 
186
 
@@ -233,7 +235,7 @@ $$
233
  \Bigg],
234
  $$
235
 
236
- where \\(\mathrm{KL}\) denotes the KL divergence and \(\xi\) denotes the compression rate, which is uniformly sampled from \(\{\frac{1}{4},\frac{1}{16}\}\). The image \(I_\xi\) is represented as 256 tokens when \(\xi=\frac{1}{4}\) and 64 tokens when \(\xi=\frac{1}{16}\). Notably, the reference model always performs inference with \(\xi=\frac{1}{4}\).
237
 
238
 
239
  `Router training`:
@@ -257,7 +259,7 @@ y_i^\text{router} =
257
  \end{cases}
258
  $$
259
 
260
- where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\) indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
261
 
262
  > Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
263
 
@@ -278,7 +280,7 @@ This approach improves reasoning breadth.
278
 
279
  ### Decoupled Vision-Language Deployment
280
 
281
- In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency.
282
  When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images.
283
 
284
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/DvD.jpg)
@@ -351,11 +353,20 @@ DvD increases GPU utilization and processing efficiency on the vision side, whil
351
 
352
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_dvd.jpg)
353
 
354
- ## Quick Start
355
 
356
- We provide an example code to run `InternVL3.5-8B-HF` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs.
 
 
 
 
 
 
 
357
 
358
- > In most cases, both [LMDeploy](https://github.com/InternLM/lmdeploy) and [vLLM](https://github.com/vllm-project/vllm) can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS.
 
 
359
 
360
  > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required.
361
 
@@ -446,6 +457,10 @@ The HuggingFace format checkpoints of our models are fully consistent with the A
446
 
447
  Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTuner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
448
 
 
 
 
 
449
  ## Deployment
450
 
451
  ### LMDeploy
@@ -494,7 +509,9 @@ image_urls=[
494
 
495
  images = [load_image(img_url) for img_url in image_urls]
496
  # Numbering images improves multi-image conversations
497
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
498
  print(response.text)
499
  ```
500
 
@@ -596,4 +613,4 @@ If you find this project useful in your research, please consider citing:
596
  journal={arXiv preprint arXiv:2508.18265},
597
  year={2025}
598
  }
599
- ```
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternVL3_5-4B-MPO
 
4
  datasets:
5
+ - OpenGVLab/MMPR-v1.2
6
+ - OpenGVLab/MMPR-Tiny
7
  language:
8
+ - multilingual
9
+ library_name: transformers
10
+ license: apache-2.0
11
+ pipeline_tag: image-text-to-text
12
  tags:
13
+ - internvl
14
+ - custom_code
15
+ base_model_relation: finetune
16
  ---
17
 
18
  # InternVL3_5-4B
19
 
20
+ This repository contains the model based on the paper [InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency](https://huggingface.co/papers/2508.18265).
21
+
22
  [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](https://huggingface.co/papers/2508.18265)
23
 
24
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
 
29
 
30
  ## Introduction
31
 
32
+ We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL) framework*, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
33
 
34
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
35
 
36
  > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial.
37
 
38
+ See [quick start](#quick-start-with-huggingface) for how to use our model.
39
 
40
  ## InternVL3.5 Family
41
 
 
139
 
140
 
141
  `InternVL3.5-Flash`:
142
+ Compared to InternVL3.5, InternVL3.5-Flash further integrates the *Visual Resolution Router (ViR)*, thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios.
143
  Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
144
  In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
145
  For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
146
+ Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50% while maintaining nearly 100% of the performance of InternVL3.5.
147
 
148
 
149
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/architecture.jpg)
 
158
  \mathcal{L}_{i}=-\log p_\theta\left(x_i \mid x_1, \ldots, x_{i-1}\right),
159
  $$
160
 
161
+ where \\(x_i\\) is the predicted token and prefix tokens in \\(\{x_1, x_2, \ldots, x_{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss.
162
+ Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows:
163
 
164
  $$
165
  \mathcal{L}_{i}^{'} = \frac{w_i}{\sum_j w_j} \cdot \mathcal{L}_i, \quad w_i = \frac{1}{N^{0.5}},
 
169
 
170
  ### Supervised Fine-Tuning
171
 
172
+ During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information.
173
+ Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources:
174
 
175
+ (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks.
176
 
177
+ (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks.
178
 
179
  (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect
180
 
181
  ### Cascade Reinforcement Learning
182
 
183
  Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner.
184
+ Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage.
185
+ Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost.
186
 
187
 
188
 
 
235
  \Bigg],
236
  $$
237
 
238
+ where \\(\mathrm{KL}\\) denotes the KL divergence and \(\xi\) denotes the compression rate, which is uniformly sampled from \(\{\frac{1}{4},\frac{1}{16}\}\). The image \(I_\xi\) is represented as 256 tokens when \(\xi=\frac{1}{4}\) and 64 tokens when \(\xi=\frac{1}{16}\). Notably, the reference model always performs inference with \(\xi=\frac{1}{4}\).
239
 
240
 
241
  `Router training`:
 
259
  \end{cases}
260
  $$
261
 
262
+ where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\) indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
263
 
264
  > Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
265
 
 
280
 
281
  ### Decoupled Vision-Language Deployment
282
 
283
+ In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency.
284
  When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images.
285
 
286
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/DvD.jpg)
 
353
 
354
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_dvd.jpg)
355
 
356
+ ## Multimodal Datasets
357
 
358
+ - [MMPR](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.2): Multimodal Progression for Reasoning, 400K high-quality data items.
359
+ - [MMPR-Tiny](https://huggingface.co/datasets/OpenGVLab/MMPR-Tiny): a tiny set contains 10000 samples in MMPR v1.2, which for testing and quick-start.
360
+ - [VisualPRM400K](https://huggingface.co/datasets/OpenGVLab/VisualPRM400K): Training data used in VisualPRM, comprised of diverse synthetic and real-world images and descriptions.
361
+ - [MMPR-v1.1](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.1): Training data used in MPO, comprised of both image-text pairs and annotated preference pairs.
362
+ - [InternVL-MMBench](https://huggingface.co/datasets/OpenGVLab/InternVL-MMBench): This dataset contains multi-round QA about detailed image content.
363
+ - [A Video Instruction Following Dataset for Long Video Understanding](https://huggingface.co/datasets/OpenGVLab/VideoInstructFollowing)
364
+ - [Ref-It Game Dataset](https://huggingface.co/datasets/OpenGVLab/Ref-It_game_dataset): a vision language task dataset.
365
+ - [VLEval](https://huggingface.co/datasets/OpenGVLab/VLEval): A Clean, Minimal, and Rich Tool for Evaluating Vision-Language Representation.
366
 
367
+ ## Quick Start with HuggingFace
368
+
369
+ We provide an example code to run `InternVL3.5-8B-HF` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs.
370
 
371
  > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required.
372
 
 
457
 
458
  Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTuner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
459
 
460
+ > [!TIP]
461
+ >
462
+ > If you want to finetune these InternVL3.5 models, Please check https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat.
463
+
464
  ## Deployment
465
 
466
  ### LMDeploy
 
509
 
510
  images = [load_image(img_url) for img_url in image_urls]
511
  # Numbering images improves multi-image conversations
512
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
513
+ Image-2: {IMAGE_TOKEN}
514
+ describe these two images', images))
515
  print(response.text)
516
  ```
517
 
 
613
  journal={arXiv preprint arXiv:2508.18265},
614
  year={2025}
615
  }
616
+ ```