Weiyun1025 commited on
Commit
6a10ffa
Β·
verified Β·
1 Parent(s): fab0633

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +32 -32
README.md CHANGED
@@ -3,7 +3,7 @@ license: apache-2.0
3
  pipeline_tag: image-text-to-text
4
  library_name: transformers
5
  base_model:
6
- - InternVL3_5-241B-A28B-MPO
7
  base_model_relation: finetune
8
  datasets:
9
  - OpenGVLab/MMPR-v1.2
@@ -30,7 +30,7 @@ We introduce *InternVL3.5*, a new family of open-source multimodal models with
30
  Benefiting from these innovations, InternVL3.5 achieves up to +18.3\% improvement in overall reasoning performance and 4.05 \\(\times\\) speedup in inference efficiency compared to its predecessor (i.e., InternVL3). In addition to these improvements, we have infused InternVL3.5 with a variety of new capabilities including GUI agent, embodied agent, etc.
31
  Specifically, InternVL3.5-241B-A28B achieves the highest overall score on multimodal general, reasoning, text, and agency tasks among leading open source MLLMs, and narrows the gap with top commercial models such as GPT-5.
32
 
33
- ![image/jpg](images/performance.jpg)
34
 
35
  > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial.
36
 
@@ -44,27 +44,27 @@ To maintain consistency with earlier generations, we provide two model formats:
44
  > If you want to convert the checkpoint between these two formats, please refer to the scripts about [custom2hf](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/tools/internvl_custom2hf.py) and [hf2custom](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/tools/internvl_hf2custom.py).
45
 
46
 
47
- | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link |
48
- | --------------------- | ------------- | --------------- | ------------ | ---------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
49
- | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-1B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-1B) |
50
- | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-2B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-2B) |
51
- | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-4B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-4B) |
52
- | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-8B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-8B) |
53
- | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-14B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-14B) |
54
- | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-38B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-38B) |
55
- | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-20B-A3B-Preview) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-20B-A4B-Preview) |
56
- | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-30B-A3B) |
57
- | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A29B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-241B-A28B) |
58
 
59
 
60
- ![image/jpg](images/performance_overall.jpg)
61
 
62
  > We conduct the evaluation with [VLMEvalkit](https://github.com/open-compass/VLMEvalKit). ***To enable the Thinking mode of our model, please set the system prompt to [R1_SYSTEM_PROMPT](https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/vlm/internvl/internvl_chat.py#L38).*** When enabling Thinking mode, we recommend setting `do_sample=True` and `temperature=0.6` to mitigate undesired repetition.
63
 
64
  Our training pipeline comprises four stages: Multimodal Continual Pre-Training (**CPT**), Supervised Fine-Tuning (**SFT**), and Cascade Reinforcement Learning (**CascadeRL**). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (**MPO**) under an offline RL setting, followed by **GSPO** under an oneline RL setting.
65
  For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (**ViCO**), which reduces the token cost required to represent an image patch.
66
 
67
- ![image/jpg](images/training_pipeline.jpg)
68
 
69
  Here, we also open-source the model weights after different training stages for potential research usage.
70
  ***If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline.***
@@ -126,7 +126,7 @@ For each patch, the patch router determines the appropriate compression rate by
126
  Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5.
127
 
128
 
129
- ![image/jpg](images/architecture.jpg)
130
 
131
  ## Training and Deployment Strategy
132
 
@@ -263,7 +263,7 @@ This approach improves reasoning breadth.
263
  In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency.
264
  When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images.
265
 
266
- ![image/jpg](images/DvD.jpg)
267
 
268
  As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models.
269
  In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls.
@@ -276,62 +276,62 @@ DvD increases GPU utilization and processing efficiency on the vision side, whil
276
 
277
  ### Multimodal Reasoning and Mathematics
278
 
279
- ![image/jpg](images/performance_reasoning.jpg)
280
 
281
  ### OCR, Chart, and Document Understanding
282
 
283
- ![image/jpg](images/performance_ocr.jpg)
284
 
285
  ### Multi-Image Understanding & Real-World Comprehension
286
 
287
- ![image/jpg](images/performance_multi_images.jpg)
288
 
289
  ### Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation
290
 
291
- ![image/jpg](images/performance_comprehensive.jpg)
292
 
293
  ### Visual Grounding
294
 
295
- ![image/jpg](images/performance_grounding.jpg)
296
 
297
  ### Multimodal Multilingual Understanding
298
 
299
- ![image/jpg](images/performance_multilingual.jpg)
300
 
301
  ### Video Understanding
302
 
303
- ![image/jpg](images/performance_video.jpg)
304
 
305
  ### GUI Tasks
306
 
307
- ![image/jpg](images/performance_gui.jpg)
308
 
309
  ### Embodied Tasks
310
 
311
- ![image/jpg](images/performance_embody.jpg)
312
 
313
  ### SVG Tasks
314
 
315
- ![image/jpg](images/performance_svg.jpg)
316
 
317
- ![image/jpg](images/performance_svg_gen.jpg)
318
 
319
  ## Evaluation on Language Capability
320
 
321
- ![image/jpg](images/performance_text.jpg)
322
 
323
  ## Ablation Study
324
 
325
  ### Cascade Reinforcement Learning
326
 
327
- ![image/jpg](images/ablation_cascade_rl.jpg)
328
 
329
- ![image/jpg](images/ablation_cascade_rl_table.jpg)
330
 
331
  ### Decoupled Vision-Language Deployment
332
 
333
 
334
- ![image/jpg](images/ablation_dvd.jpg)
335
 
336
  ## Quick Start
337
 
 
3
  pipeline_tag: image-text-to-text
4
  library_name: transformers
5
  base_model:
6
+ - OpenGVLab/InternVL3_5-241B-A28B-MPO
7
  base_model_relation: finetune
8
  datasets:
9
  - OpenGVLab/MMPR-v1.2
 
30
  Benefiting from these innovations, InternVL3.5 achieves up to +18.3\% improvement in overall reasoning performance and 4.05 \\(\times\\) speedup in inference efficiency compared to its predecessor (i.e., InternVL3). In addition to these improvements, we have infused InternVL3.5 with a variety of new capabilities including GUI agent, embodied agent, etc.
31
  Specifically, InternVL3.5-241B-A28B achieves the highest overall score on multimodal general, reasoning, text, and agency tasks among leading open source MLLMs, and narrows the gap with top commercial models such as GPT-5.
32
 
33
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
34
 
35
  > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial.
36
 
 
44
  > If you want to convert the checkpoint between these two formats, please refer to the scripts about [custom2hf](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/tools/internvl_custom2hf.py) and [hf2custom](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/tools/internvl_hf2custom.py).
45
 
46
 
47
+ | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link |
48
+ | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
49
+ | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-1B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-1B) |
50
+ | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-2B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-2B) |
51
+ | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-4B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-4B) |
52
+ | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-8B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-8B) |
53
+ | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-14B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-14B) |
54
+ | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-38B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-38B) |
55
+ | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview) |
56
+ | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-30B-A3B) |
57
+ | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A29B | [πŸ€— link](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B) | [πŸ€– link](https://www.modelscope.cn/models/OpenGVLab/InternVL3_5-241B-A28B) |
58
 
59
 
60
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_overall.jpg)
61
 
62
  > We conduct the evaluation with [VLMEvalkit](https://github.com/open-compass/VLMEvalKit). ***To enable the Thinking mode of our model, please set the system prompt to [R1_SYSTEM_PROMPT](https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/vlm/internvl/internvl_chat.py#L38).*** When enabling Thinking mode, we recommend setting `do_sample=True` and `temperature=0.6` to mitigate undesired repetition.
63
 
64
  Our training pipeline comprises four stages: Multimodal Continual Pre-Training (**CPT**), Supervised Fine-Tuning (**SFT**), and Cascade Reinforcement Learning (**CascadeRL**). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (**MPO**) under an offline RL setting, followed by **GSPO** under an oneline RL setting.
65
  For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (**ViCO**), which reduces the token cost required to represent an image patch.
66
 
67
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/training_pipeline.jpg)
68
 
69
  Here, we also open-source the model weights after different training stages for potential research usage.
70
  ***If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline.***
 
126
  Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5.
127
 
128
 
129
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/architecture.jpg)
130
 
131
  ## Training and Deployment Strategy
132
 
 
263
  In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency.
264
  When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images.
265
 
266
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/DvD.jpg)
267
 
268
  As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models.
269
  In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls.
 
276
 
277
  ### Multimodal Reasoning and Mathematics
278
 
279
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_reasoning.jpg)
280
 
281
  ### OCR, Chart, and Document Understanding
282
 
283
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_ocr.jpg)
284
 
285
  ### Multi-Image Understanding & Real-World Comprehension
286
 
287
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_multi_images.jpg)
288
 
289
  ### Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation
290
 
291
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_comprehensive.jpg)
292
 
293
  ### Visual Grounding
294
 
295
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_grounding.jpg)
296
 
297
  ### Multimodal Multilingual Understanding
298
 
299
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_multilingual.jpg)
300
 
301
  ### Video Understanding
302
 
303
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_video.jpg)
304
 
305
  ### GUI Tasks
306
 
307
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_gui.jpg)
308
 
309
  ### Embodied Tasks
310
 
311
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_embody.jpg)
312
 
313
  ### SVG Tasks
314
 
315
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_svg.jpg)
316
 
317
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_svg_gen.jpg)
318
 
319
  ## Evaluation on Language Capability
320
 
321
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance_text.jpg)
322
 
323
  ## Ablation Study
324
 
325
  ### Cascade Reinforcement Learning
326
 
327
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_cascade_rl.jpg)
328
 
329
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_cascade_rl_table.jpg)
330
 
331
  ### Decoupled Vision-Language Deployment
332
 
333
 
334
+ ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_dvd.jpg)
335
 
336
  ## Quick Start
337