Weiyun1025 commited on
Commit
4cd1983
·
verified ·
1 Parent(s): b48dae5

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +4 -6
README.md CHANGED
@@ -17,7 +17,7 @@ tags:
17
 
18
  # InternVL3_5-GPT-OSS-20B-A4B-Preview
19
 
20
- [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](TBD)
21
 
22
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
23
 
@@ -27,9 +27,7 @@ tags:
27
 
28
  ## Introduction
29
 
30
- We introduce *InternVL3.5*, a new family of open-source multimodal models with a significant improvement in versatility, reasoning, and efficiency. InternVL3.5 is equipped with strong reasoning ability via a scalable reinforcement learning framework, termed *Cascade Reinforcement Learning (Cascade RL)*. Through an offline RL phase for efficient convergence and an online RL stage for distribution refinement, Cascade RL efficiently realizes a coarse-to-fine RL process and achieves significant gains for downstream reasoning tasks. To further improve inference efficiency, we introduce a *Visual Resolution Router (ViR)* that dynamically selects the trade-off resolution of visual tokens for MLLMs while maintaining original performance. Combining with ViR, the *Decoupled Vision-Language Deployment (DvD)* is adopted to deploy the vision encoder and the language model on separate GPUs to balance computational load.
31
- Benefiting from these innovations, InternVL3.5 achieves up to +18.3\% improvement in overall reasoning performance and 4.05 \\(\times\\) speedup in inference efficiency compared to its predecessor (i.e., InternVL3). In addition to these improvements, we have infused InternVL3.5 with a variety of new capabilities including GUI agent, embodied agent, etc.
32
- Specifically, InternVL3.5-241B-A28B achieves the highest overall score on multimodal general, reasoning, text, and agency tasks among leading open source MLLMs, and narrows the gap with top commercial models such as GPT-5.
33
 
34
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
35
 
@@ -192,7 +190,7 @@ $$
192
 
193
  where the importance sampling ratio is defined as the geometric mean of the per-token ratios.
194
 
195
- > Please see [our paper](TBD) for more technical and experimental details.
196
 
197
 
198
  ### Visual Consistency Learning
@@ -242,7 +240,7 @@ $$
242
 
243
  where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\) indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
244
 
245
- > Please see [our paper](TBD) for more technical and experimental details.
246
 
247
 
248
  ### Test-Time Scaling
 
17
 
18
  # InternVL3_5-GPT-OSS-20B-A4B-Preview
19
 
20
+ [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](https://huggingface.co/papers/2508.18265)
21
 
22
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
23
 
 
27
 
28
  ## Introduction
29
 
30
+ We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
 
 
31
 
32
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
33
 
 
190
 
191
  where the importance sampling ratio is defined as the geometric mean of the per-token ratios.
192
 
193
+ > Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
194
 
195
 
196
  ### Visual Consistency Learning
 
240
 
241
  where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\) indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
242
 
243
+ > Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
244
 
245
 
246
  ### Test-Time Scaling