Weiyun1025 commited on
Commit
68e8dbc
Β·
verified Β·
1 Parent(s): a7a87e7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +4 -6
README.md CHANGED
@@ -16,7 +16,7 @@ tags:
16
 
17
  # InternVL3_5-38B-Instruct
18
 
19
- [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](TBD)
20
 
21
  [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
22
 
@@ -26,9 +26,7 @@ tags:
26
 
27
  ## Introduction
28
 
29
- We introduce *InternVL3.5*, a new family of open-source multimodal models with a significant improvement in versatility, reasoning, and efficiency. InternVL3.5 is equipped with strong reasoning ability via a scalable reinforcement learning framework, termed *Cascade Reinforcement Learning (Cascade RL)*. Through an offline RL phase for efficient convergence and an online RL stage for distribution refinement, Cascade RL efficiently realizes a coarse-to-fine RL process and achieves significant gains for downstream reasoning tasks. To further improve inference efficiency, we introduce a *Visual Resolution Router (ViR)* that dynamically selects the trade-off resolution of visual tokens for MLLMs while maintaining original performance. Combining with ViR, the *Decoupled Vision-Language Deployment (DvD)* is adopted to deploy the vision encoder and the language model on separate GPUs to balance computational load.
30
- Benefiting from these innovations, InternVL3.5 achieves up to +18.3\% improvement in overall reasoning performance and 4.05 \\(\times\\) speedup in inference efficiency compared to its predecessor (i.e., InternVL3). In addition to these improvements, we have infused InternVL3.5 with a variety of new capabilities including GUI agent, embodied agent, etc.
31
- Specifically, InternVL3.5-241B-A28B achieves the highest overall score on multimodal general, reasoning, text, and agency tasks among leading open source MLLMs, and narrows the gap with top commercial models such as GPT-5.
32
 
33
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
34
 
@@ -191,7 +189,7 @@ $$
191
 
192
  where the importance sampling ratio is defined as the geometric mean of the per-token ratios.
193
 
194
- > Please see [our paper](TBD) for more technical and experimental details.
195
 
196
 
197
  ### Visual Consistency Learning
@@ -241,7 +239,7 @@ $$
241
 
242
  where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\) indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
243
 
244
- > Please see [our paper](TBD) for more technical and experimental details.
245
 
246
 
247
  ### Test-Time Scaling
 
16
 
17
  # InternVL3_5-38B-Instruct
18
 
19
+ [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265)
20
 
21
  [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
22
 
 
26
 
27
  ## Introduction
28
 
29
+ We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
 
 
30
 
31
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
32
 
 
189
 
190
  where the importance sampling ratio is defined as the geometric mean of the per-token ratios.
191
 
192
+ > Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
193
 
194
 
195
  ### Visual Consistency Learning
 
239
 
240
  where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\) indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
241
 
242
+ > Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
243
 
244
 
245
  ### Test-Time Scaling