Improve InternVL3_5-4B Model Card
Browse filesThis PR improves the model card for InternVL3_5-4B by:
- Adding a prominent link to the paper "InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency".
- Organizing the "Quick Start" and "Deployment" sections for better clarity, separating `transformers` usage from deployment solutions like `LMDeploy`.
- Adding a new "Multimodal Datasets" section to highlight relevant datasets used by the model.
- Adding a helpful tip for finetuning these models.
The metadata (`pipeline_tag`, `library_name`, `license`) remains unchanged as it is already correct and comprehensive.
README.md
CHANGED
|
@@ -1,22 +1,24 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
pipeline_tag: image-text-to-text
|
| 4 |
-
library_name: transformers
|
| 5 |
base_model:
|
| 6 |
-
|
| 7 |
-
base_model_relation: finetune
|
| 8 |
datasets:
|
| 9 |
-
|
| 10 |
-
|
| 11 |
language:
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
| 13 |
tags:
|
| 14 |
-
|
| 15 |
-
|
|
|
|
| 16 |
---
|
| 17 |
|
| 18 |
# InternVL3_5-4B
|
| 19 |
|
|
|
|
|
|
|
| 20 |
[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](https://huggingface.co/papers/2508.18265)
|
| 21 |
|
| 22 |
[\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
|
|
@@ -27,13 +29,13 @@ tags:
|
|
| 27 |
|
| 28 |
## Introduction
|
| 29 |
|
| 30 |
-
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)
|
| 31 |
|
| 32 |

|
| 33 |
|
| 34 |
> Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial.
|
| 35 |
|
| 36 |
-
See [quick start](#quick-start) for how to use our model.
|
| 37 |
|
| 38 |
## InternVL3.5 Family
|
| 39 |
|
|
@@ -137,11 +139,11 @@ The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained
|
|
| 137 |
|
| 138 |
|
| 139 |
`InternVL3.5-Flash`:
|
| 140 |
-
Compared to InternVL3.5, InternVL3.5-Flash further integrates the *Visual Resolution Router (ViR)*, thus yielding a series of
|
| 141 |
Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
|
| 142 |
In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
|
| 143 |
For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
|
| 144 |
-
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50
|
| 145 |
|
| 146 |
|
| 147 |

|
|
@@ -156,8 +158,8 @@ $$
|
|
| 156 |
\mathcal{L}_{i}=-\log p_\theta\left(x_i \mid x_1, \ldots, x_{i-1}\right),
|
| 157 |
$$
|
| 158 |
|
| 159 |
-
where \\(x_i\\) is the predicted token and
|
| 160 |
-
Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss
|
| 161 |
|
| 162 |
$$
|
| 163 |
\mathcal{L}_{i}^{'} = \frac{w_i}{\sum_j w_j} \cdot \mathcal{L}_i, \quad w_i = \frac{1}{N^{0.5}},
|
|
@@ -167,20 +169,20 @@ where \\(N\\) denotes the number of tokens in the training sample on which the l
|
|
| 167 |
|
| 168 |
### Supervised Fine-Tuning
|
| 169 |
|
| 170 |
-
During the SFT phase, we adopt the same objective as in the pre-training stage and use the
|
| 171 |
-
Compared to InternVL3, the SFT stage of InternVL3.5 contains
|
| 172 |
|
| 173 |
-
(1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks.
|
| 174 |
|
| 175 |
-
(2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks.
|
| 176 |
|
| 177 |
(3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect
|
| 178 |
|
| 179 |
### Cascade Reinforcement Learning
|
| 180 |
|
| 181 |
Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner.
|
| 182 |
-
Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage.
|
| 183 |
-
Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself.
|
| 184 |
|
| 185 |
|
| 186 |
|
|
@@ -233,7 +235,7 @@ $$
|
|
| 233 |
\Bigg],
|
| 234 |
$$
|
| 235 |
|
| 236 |
-
where \\(\mathrm{KL}
|
| 237 |
|
| 238 |
|
| 239 |
`Router training`:
|
|
@@ -257,7 +259,7 @@ y_i^\text{router} =
|
|
| 257 |
\end{cases}
|
| 258 |
$$
|
| 259 |
|
| 260 |
-
where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\)
|
| 261 |
|
| 262 |
> Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
|
| 263 |
|
|
@@ -278,7 +280,7 @@ This approach improves reasoning breadth.
|
|
| 278 |
|
| 279 |
### Decoupled Vision-Language Deployment
|
| 280 |
|
| 281 |
-
In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state.
|
| 282 |
When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images.
|
| 283 |
|
| 284 |

|
|
@@ -351,11 +353,20 @@ DvD increases GPU utilization and processing efficiency on the vision side, whil
|
|
| 351 |
|
| 352 |

|
| 353 |
|
| 354 |
-
##
|
| 355 |
|
| 356 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 357 |
|
| 358 |
-
|
|
|
|
|
|
|
| 359 |
|
| 360 |
> Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required.
|
| 361 |
|
|
@@ -446,6 +457,10 @@ The HuggingFace format checkpoints of our models are fully consistent with the A
|
|
| 446 |
|
| 447 |
Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTuner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
|
| 448 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 449 |
## Deployment
|
| 450 |
|
| 451 |
### LMDeploy
|
|
@@ -494,7 +509,9 @@ image_urls=[
|
|
| 494 |
|
| 495 |
images = [load_image(img_url) for img_url in image_urls]
|
| 496 |
# Numbering images improves multi-image conversations
|
| 497 |
-
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
|
|
|
|
|
|
| 498 |
print(response.text)
|
| 499 |
```
|
| 500 |
|
|
@@ -596,4 +613,4 @@ If you find this project useful in your research, please consider citing:
|
|
| 596 |
journal={arXiv preprint arXiv:2508.18265},
|
| 597 |
year={2025}
|
| 598 |
}
|
| 599 |
-
```
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
+
- OpenGVLab/InternVL3_5-4B-MPO
|
|
|
|
| 4 |
datasets:
|
| 5 |
+
- OpenGVLab/MMPR-v1.2
|
| 6 |
+
- OpenGVLab/MMPR-Tiny
|
| 7 |
language:
|
| 8 |
+
- multilingual
|
| 9 |
+
library_name: transformers
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
pipeline_tag: image-text-to-text
|
| 12 |
tags:
|
| 13 |
+
- internvl
|
| 14 |
+
- custom_code
|
| 15 |
+
base_model_relation: finetune
|
| 16 |
---
|
| 17 |
|
| 18 |
# InternVL3_5-4B
|
| 19 |
|
| 20 |
+
This repository contains the model based on the paper [InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency](https://huggingface.co/papers/2508.18265).
|
| 21 |
+
|
| 22 |
[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](https://huggingface.co/papers/2508.18265)
|
| 23 |
|
| 24 |
[\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
|
|
|
|
| 29 |
|
| 30 |
## Introduction
|
| 31 |
|
| 32 |
+
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL) framework*, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
|
| 33 |
|
| 34 |

|
| 35 |
|
| 36 |
> Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial.
|
| 37 |
|
| 38 |
+
See [quick start](#quick-start-with-huggingface) for how to use our model.
|
| 39 |
|
| 40 |
## InternVL3.5 Family
|
| 41 |
|
|
|
|
| 139 |
|
| 140 |
|
| 141 |
`InternVL3.5-Flash`:
|
| 142 |
+
Compared to InternVL3.5, InternVL3.5-Flash further integrates the *Visual Resolution Router (ViR)*, thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios.
|
| 143 |
Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
|
| 144 |
In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
|
| 145 |
For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
|
| 146 |
+
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50% while maintaining nearly 100% of the performance of InternVL3.5.
|
| 147 |
|
| 148 |
|
| 149 |

|
|
|
|
| 158 |
\mathcal{L}_{i}=-\log p_\theta\left(x_i \mid x_1, \ldots, x_{i-1}\right),
|
| 159 |
$$
|
| 160 |
|
| 161 |
+
where \\(x_i\\) is the predicted token and prefix tokens in \\(\{x_1, x_2, \ldots, x_{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss.
|
| 162 |
+
Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows:
|
| 163 |
|
| 164 |
$$
|
| 165 |
\mathcal{L}_{i}^{'} = \frac{w_i}{\sum_j w_j} \cdot \mathcal{L}_i, \quad w_i = \frac{1}{N^{0.5}},
|
|
|
|
| 169 |
|
| 170 |
### Supervised Fine-Tuning
|
| 171 |
|
| 172 |
+
During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information.
|
| 173 |
+
Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources:
|
| 174 |
|
| 175 |
+
(1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks.
|
| 176 |
|
| 177 |
+
(2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks.
|
| 178 |
|
| 179 |
(3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect
|
| 180 |
|
| 181 |
### Cascade Reinforcement Learning
|
| 182 |
|
| 183 |
Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner.
|
| 184 |
+
Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage.
|
| 185 |
+
Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost.
|
| 186 |
|
| 187 |
|
| 188 |
|
|
|
|
| 235 |
\Bigg],
|
| 236 |
$$
|
| 237 |
|
| 238 |
+
where \\(\mathrm{KL}\\) denotes the KL divergence and \(\xi\) denotes the compression rate, which is uniformly sampled from \(\{\frac{1}{4},\frac{1}{16}\}\). The image \(I_\xi\) is represented as 256 tokens when \(\xi=\frac{1}{4}\) and 64 tokens when \(\xi=\frac{1}{16}\). Notably, the reference model always performs inference with \(\xi=\frac{1}{4}\).
|
| 239 |
|
| 240 |
|
| 241 |
`Router training`:
|
|
|
|
| 259 |
\end{cases}
|
| 260 |
$$
|
| 261 |
|
| 262 |
+
where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\) indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
|
| 263 |
|
| 264 |
> Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
|
| 265 |
|
|
|
|
| 280 |
|
| 281 |
### Decoupled Vision-Language Deployment
|
| 282 |
|
| 283 |
+
In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency.
|
| 284 |
When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images.
|
| 285 |
|
| 286 |

|
|
|
|
| 353 |
|
| 354 |

|
| 355 |
|
| 356 |
+
## Multimodal Datasets
|
| 357 |
|
| 358 |
+
- [MMPR](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.2): Multimodal Progression for Reasoning, 400K high-quality data items.
|
| 359 |
+
- [MMPR-Tiny](https://huggingface.co/datasets/OpenGVLab/MMPR-Tiny): a tiny set contains 10000 samples in MMPR v1.2, which for testing and quick-start.
|
| 360 |
+
- [VisualPRM400K](https://huggingface.co/datasets/OpenGVLab/VisualPRM400K): Training data used in VisualPRM, comprised of diverse synthetic and real-world images and descriptions.
|
| 361 |
+
- [MMPR-v1.1](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.1): Training data used in MPO, comprised of both image-text pairs and annotated preference pairs.
|
| 362 |
+
- [InternVL-MMBench](https://huggingface.co/datasets/OpenGVLab/InternVL-MMBench): This dataset contains multi-round QA about detailed image content.
|
| 363 |
+
- [A Video Instruction Following Dataset for Long Video Understanding](https://huggingface.co/datasets/OpenGVLab/VideoInstructFollowing)
|
| 364 |
+
- [Ref-It Game Dataset](https://huggingface.co/datasets/OpenGVLab/Ref-It_game_dataset): a vision language task dataset.
|
| 365 |
+
- [VLEval](https://huggingface.co/datasets/OpenGVLab/VLEval): A Clean, Minimal, and Rich Tool for Evaluating Vision-Language Representation.
|
| 366 |
|
| 367 |
+
## Quick Start with HuggingFace
|
| 368 |
+
|
| 369 |
+
We provide an example code to run `InternVL3.5-8B-HF` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs.
|
| 370 |
|
| 371 |
> Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required.
|
| 372 |
|
|
|
|
| 457 |
|
| 458 |
Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTuner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
|
| 459 |
|
| 460 |
+
> [!TIP]
|
| 461 |
+
>
|
| 462 |
+
> If you want to finetune these InternVL3.5 models, Please check https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat.
|
| 463 |
+
|
| 464 |
## Deployment
|
| 465 |
|
| 466 |
### LMDeploy
|
|
|
|
| 509 |
|
| 510 |
images = [load_image(img_url) for img_url in image_urls]
|
| 511 |
# Numbering images improves multi-image conversations
|
| 512 |
+
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
| 513 |
+
Image-2: {IMAGE_TOKEN}
|
| 514 |
+
describe these two images', images))
|
| 515 |
print(response.text)
|
| 516 |
```
|
| 517 |
|
|
|
|
| 613 |
journal={arXiv preprint arXiv:2508.18265},
|
| 614 |
year={2025}
|
| 615 |
}
|
| 616 |
+
```
|