OpenGVLab
/

InternVL3_5-38B-Instruct

@@ -16,7 +16,7 @@ tags:
 # InternVL3_5-38B-Instruct
-[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL)  [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238)  [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821)  [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271)  [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442)  [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](TBD)
 [\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/)  [\[🚀 Quick Start\]](#quick-start)  [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
@@ -26,9 +26,7 @@ tags:
 ## Introduction
-We introduce *InternVL3.5*, a new family of open-source multimodal  models with a significant improvement in versatility, reasoning, and efficiency.  InternVL3.5 is equipped with strong reasoning ability via  a scalable  reinforcement learning framework, termed *Cascade Reinforcement Learning (Cascade RL)*. Through an offline RL phase for efficient convergence and an online RL stage for distribution refinement, Cascade RL efficiently realizes a coarse-to-fine RL process and achieves  significant gains for downstream reasoning tasks.  To further improve inference efficiency, we introduce  a *Visual Resolution Router (ViR)* that dynamically selects the trade-off resolution of visual tokens for MLLMs while maintaining  original performance. Combining with ViR,  the *Decoupled Vision-Language Deployment (DvD)* is adopted to deploy the vision encoder and the language model on separate GPUs to balance computational load.
-Benefiting from these innovations, InternVL3.5 achieves up to +18.3\% improvement in overall reasoning performance and  4.05 \\(\times\\) speedup in inference efficiency compared to its predecessor (i.e., InternVL3).  In addition to these improvements, we have infused InternVL3.5 with a variety of new capabilities including GUI agent, embodied agent, etc.
-Specifically, InternVL3.5-241B-A28B achieves the highest overall score on multimodal general, reasoning, text, and agency tasks among leading open source MLLMs, and narrows the gap with top commercial models such as GPT-5.
 ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
@@ -191,7 +189,7 @@ $$
 where the importance sampling ratio is defined as the geometric mean of the per-token ratios.
-> Please see [our paper](TBD) for more technical and experimental details.
 ### Visual Consistency Learning
@@ -241,7 +239,7 @@ $$
 where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\)  indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
-> Please see [our paper](TBD) for more technical and experimental details.
 ### Test-Time Scaling

 # InternVL3_5-38B-Instruct
+[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL)  [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238)  [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821)  [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271)  [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442)  [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](https://huggingface.co/papers/2508.18265)
 [\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/)  [\[🚀 Quick Start\]](#quick-start)  [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
 ## Introduction
+We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e.,  InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
 ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
 where the importance sampling ratio is defined as the geometric mean of the per-token ratios.
+> Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
 ### Visual Consistency Learning
 where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\)  indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
+> Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
 ### Test-Time Scaling