Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -3,7 +3,7 @@ license: apache-2.0
|
|
| 3 |
pipeline_tag: image-text-to-text
|
| 4 |
library_name: transformers
|
| 5 |
base_model:
|
| 6 |
-
- InternVL3_5-241B-A28B-MPO
|
| 7 |
base_model_relation: finetune
|
| 8 |
datasets:
|
| 9 |
- OpenGVLab/MMPR-v1.2
|
|
@@ -30,7 +30,7 @@ We introduce *InternVL3.5*, a new family of open-source multimodal models with
|
|
| 30 |
Benefiting from these innovations, InternVL3.5 achieves up to +18.3\% improvement in overall reasoning performance and 4.05 \\(\times\\) speedup in inference efficiency compared to its predecessor (i.e., InternVL3). In addition to these improvements, we have infused InternVL3.5 with a variety of new capabilities including GUI agent, embodied agent, etc.
|
| 31 |
Specifically, InternVL3.5-241B-A28B achieves the highest overall score on multimodal general, reasoning, text, and agency tasks among leading open source MLLMs, and narrows the gap with top commercial models such as GPT-5.
|
| 32 |
|
| 33 |
-

|
| 34 |
|
| 35 |
> Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial.
|
| 36 |
|
|
@@ -44,27 +44,27 @@ To maintain consistency with earlier generations, we provide two model formats:
|
|
| 44 |
> If you want to convert the checkpoint between these two formats, please refer to the scripts about [custom2hf](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/tools/internvl_custom2hf.py) and [hf2custom](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/tools/internvl_hf2custom.py).
|
| 45 |
|
| 46 |
|
| 47 |
-
| Model | #Vision Param | #Language Param | #Total Param | HF Link
|
| 48 |
-
| --------------------- | ------------- | --------------- | ------------ |
|
| 49 |
-
| InternVL3.5-1B | 0.3B | 0.8B | 1.1B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-1B)
|
| 50 |
-
| InternVL3.5-2B | 0.3B | 2.0B | 2.3B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-2B)
|
| 51 |
-
| InternVL3.5-4B | 0.3B | 4.4B | 4.7B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-4B)
|
| 52 |
-
| InternVL3.5-8B | 0.3B | 8.2B | 8.5B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-8B)
|
| 53 |
-
| InternVL3.5-14B | 0.3B | 14.8B | 15.1B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-14B)
|
| 54 |
-
| InternVL3.5-38B | 5.5B | 32.8B | 38.4B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-38B)
|
| 55 |
-
| InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-20B-
|
| 56 |
-
| InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B)
|
| 57 |
-
| InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A29B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B)
|
| 58 |
|
| 59 |
|
| 60 |
-

|
| 61 |
|
| 62 |
> We conduct the evaluation with [VLMEvalkit](https://github.com/open-compass/VLMEvalKit). ***To enable the Thinking mode of our model, please set the system prompt to [R1_SYSTEM_PROMPT](https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/vlm/internvl/internvl_chat.py#L38).*** When enabling Thinking mode, we recommend setting `do_sample=True` and `temperature=0.6` to mitigate undesired repetition.
|
| 63 |
|
| 64 |
Our training pipeline comprises four stages: Multimodal Continual Pre-Training (**CPT**), Supervised Fine-Tuning (**SFT**), and Cascade Reinforcement Learning (**CascadeRL**). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (**MPO**) under an offline RL setting, followed by **GSPO** under an oneline RL setting.
|
| 65 |
For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (**ViCO**), which reduces the token cost required to represent an image patch.
|
| 66 |
|
| 67 |
-

|
| 68 |
|
| 69 |
Here, we also open-source the model weights after different training stages for potential research usage.
|
| 70 |
***If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline.***
|
|
@@ -126,7 +126,7 @@ For each patch, the patch router determines the appropriate compression rate by
|
|
| 126 |
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5.
|
| 127 |
|
| 128 |
|
| 129 |
-

|
| 130 |
|
| 131 |
## Training and Deployment Strategy
|
| 132 |
|
|
@@ -263,7 +263,7 @@ This approach improves reasoning breadth.
|
|
| 263 |
In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency.
|
| 264 |
When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images.
|
| 265 |
|
| 266 |
-

|
| 267 |
|
| 268 |
As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models.
|
| 269 |
In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls.
|
|
@@ -276,62 +276,62 @@ DvD increases GPU utilization and processing efficiency on the vision side, whil
|
|
| 276 |
|
| 277 |
### Multimodal Reasoning and Mathematics
|
| 278 |
|
| 279 |
-

|
| 280 |
|
| 281 |
### OCR, Chart, and Document Understanding
|
| 282 |
|
| 283 |
-

|
| 284 |
|
| 285 |
### Multi-Image Understanding & Real-World Comprehension
|
| 286 |
|
| 287 |
-

|
| 288 |
|
| 289 |
### Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation
|
| 290 |
|
| 291 |
-

|
| 292 |
|
| 293 |
### Visual Grounding
|
| 294 |
|
| 295 |
-

|
| 296 |
|
| 297 |
### Multimodal Multilingual Understanding
|
| 298 |
|
| 299 |
-

|
| 300 |
|
| 301 |
### Video Understanding
|
| 302 |
|
| 303 |
-

|
| 304 |
|
| 305 |
### GUI Tasks
|
| 306 |
|
| 307 |
-

|
| 308 |
|
| 309 |
### Embodied Tasks
|
| 310 |
|
| 311 |
-

|
| 312 |
|
| 313 |
### SVG Tasks
|
| 314 |
|
| 315 |
-

|
| 316 |
|
| 317 |
-

|
| 318 |
|
| 319 |
## Evaluation on Language Capability
|
| 320 |
|
| 321 |
-

|
| 322 |
|
| 323 |
## Ablation Study
|
| 324 |
|
| 325 |
### Cascade Reinforcement Learning
|
| 326 |
|
| 327 |
-

|
| 328 |
|
| 329 |
-

|
| 330 |
|
| 331 |
### Decoupled Vision-Language Deployment
|
| 332 |
|
| 333 |
|
| 334 |
-

|
| 335 |
|
| 336 |
## Quick Start
|
| 337 |
|
|
|
|
| 3 |
pipeline_tag: image-text-to-text
|
| 4 |
library_name: transformers
|
| 5 |
base_model:
|
| 6 |
+
- OpenGVLab/InternVL3_5-241B-A28B-MPO
|
| 7 |
base_model_relation: finetune
|
| 8 |
datasets:
|
| 9 |
- OpenGVLab/MMPR-v1.2
|
|
|
|
| 30 |
Benefiting from these innovations, InternVL3.5 achieves up to +18.3\% improvement in overall reasoning performance and 4.05 \\(\times\\) speedup in inference efficiency compared to its predecessor (i.e., InternVL3). In addition to these improvements, we have infused InternVL3.5 with a variety of new capabilities including GUI agent, embodied agent, etc.
|
| 31 |
Specifically, InternVL3.5-241B-A28B achieves the highest overall score on multimodal general, reasoning, text, and agency tasks among leading open source MLLMs, and narrows the gap with top commercial models such as GPT-5.
|
| 32 |
|
| 33 |
+

|
| 34 |
|
| 35 |
> Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial.
|
| 36 |
|
|
|
|
| 44 |
> If you want to convert the checkpoint between these two formats, please refer to the scripts about [custom2hf](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/tools/internvl_custom2hf.py) and [hf2custom](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/tools/internvl_hf2custom.py).
|
| 45 |
|
| 46 |
|
| 47 |
+
| Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link |
|
| 48 |
+
| --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
| 49 |
+
| InternVL3.5-1B | 0.3B | 0.8B | 1.1B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-1B) | [π€ link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-1B) |
|
| 50 |
+
| InternVL3.5-2B | 0.3B | 2.0B | 2.3B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-2B) | [π€ link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-2B) |
|
| 51 |
+
| InternVL3.5-4B | 0.3B | 4.4B | 4.7B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-4B) | [π€ link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-4B) |
|
| 52 |
+
| InternVL3.5-8B | 0.3B | 8.2B | 8.5B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-8B) | [π€ link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-8B) |
|
| 53 |
+
| InternVL3.5-14B | 0.3B | 14.8B | 15.1B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-14B) | [π€ link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-14B) |
|
| 54 |
+
| InternVL3.5-38B | 5.5B | 32.8B | 38.4B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-38B) | [π€ link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-38B) |
|
| 55 |
+
| InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview) | [π€ link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview) |
|
| 56 |
+
| InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B) | [π€ link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-30B-A3B) |
|
| 57 |
+
| InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A29B | [π€ link](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B) | [π€ link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-241B-A28B) |
|
| 58 |
|
| 59 |
|
| 60 |
+

|
| 61 |
|
| 62 |
> We conduct the evaluation with [VLMEvalkit](https://github.com/open-compass/VLMEvalKit). ***To enable the Thinking mode of our model, please set the system prompt to [R1_SYSTEM_PROMPT](https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/vlm/internvl/internvl_chat.py#L38).*** When enabling Thinking mode, we recommend setting `do_sample=True` and `temperature=0.6` to mitigate undesired repetition.
|
| 63 |
|
| 64 |
Our training pipeline comprises four stages: Multimodal Continual Pre-Training (**CPT**), Supervised Fine-Tuning (**SFT**), and Cascade Reinforcement Learning (**CascadeRL**). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (**MPO**) under an offline RL setting, followed by **GSPO** under an oneline RL setting.
|
| 65 |
For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (**ViCO**), which reduces the token cost required to represent an image patch.
|
| 66 |
|
| 67 |
+

|
| 68 |
|
| 69 |
Here, we also open-source the model weights after different training stages for potential research usage.
|
| 70 |
***If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline.***
|
|
|
|
| 126 |
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5.
|
| 127 |
|
| 128 |
|
| 129 |
+

|
| 130 |
|
| 131 |
## Training and Deployment Strategy
|
| 132 |
|
|
|
|
| 263 |
In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency.
|
| 264 |
When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images.
|
| 265 |
|
| 266 |
+

|
| 267 |
|
| 268 |
As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models.
|
| 269 |
In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls.
|
|
|
|
| 276 |
|
| 277 |
### Multimodal Reasoning and Mathematics
|
| 278 |
|
| 279 |
+

|
| 280 |
|
| 281 |
### OCR, Chart, and Document Understanding
|
| 282 |
|
| 283 |
+

|
| 284 |
|
| 285 |
### Multi-Image Understanding & Real-World Comprehension
|
| 286 |
|
| 287 |
+

|
| 288 |
|
| 289 |
### Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation
|
| 290 |
|
| 291 |
+

|
| 292 |
|
| 293 |
### Visual Grounding
|
| 294 |
|
| 295 |
+

|
| 296 |
|
| 297 |
### Multimodal Multilingual Understanding
|
| 298 |
|
| 299 |
+

|
| 300 |
|
| 301 |
### Video Understanding
|
| 302 |
|
| 303 |
+

|
| 304 |
|
| 305 |
### GUI Tasks
|
| 306 |
|
| 307 |
+

|
| 308 |
|
| 309 |
### Embodied Tasks
|
| 310 |
|
| 311 |
+

|
| 312 |
|
| 313 |
### SVG Tasks
|
| 314 |
|
| 315 |
+

|
| 316 |
|
| 317 |
+

|
| 318 |
|
| 319 |
## Evaluation on Language Capability
|
| 320 |
|
| 321 |
+

|
| 322 |
|
| 323 |
## Ablation Study
|
| 324 |
|
| 325 |
### Cascade Reinforcement Learning
|
| 326 |
|
| 327 |
+

|
| 328 |
|
| 329 |
+

|
| 330 |
|
| 331 |
### Decoupled Vision-Language Deployment
|
| 332 |
|
| 333 |
|
| 334 |
+

|
| 335 |
|
| 336 |
## Quick Start
|
| 337 |
|