Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -16,7 +16,7 @@ tags:
|
|
16 |
|
17 |
# InternVL3_5-38B-Instruct
|
18 |
|
19 |
-
[\[π GitHub\]](https://github.com/OpenGVLab/InternVL) [\[π InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[π InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[π InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[π InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[π InternVL3\]](https://huggingface.co/papers/2504.10479) [\[π InternVL3.5\]](
|
20 |
|
21 |
[\[π Blog\]](https://internvl.github.io/blog/) [\[π¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[π Quick Start\]](#quick-start) [\[π Documents\]](https://internvl.readthedocs.io/en/latest/)
|
22 |
|
@@ -26,9 +26,7 @@ tags:
|
|
26 |
|
27 |
## Introduction
|
28 |
|
29 |
-
We introduce *InternVL3.5*, a new family of open-source multimodal
|
30 |
-
Benefiting from these innovations, InternVL3.5 achieves up to +18.3\% improvement in overall reasoning performance and 4.05 \\(\times\\) speedup in inference efficiency compared to its predecessor (i.e., InternVL3). In addition to these improvements, we have infused InternVL3.5 with a variety of new capabilities including GUI agent, embodied agent, etc.
|
31 |
-
Specifically, InternVL3.5-241B-A28B achieves the highest overall score on multimodal general, reasoning, text, and agency tasks among leading open source MLLMs, and narrows the gap with top commercial models such as GPT-5.
|
32 |
|
33 |

|
34 |
|
@@ -191,7 +189,7 @@ $$
|
|
191 |
|
192 |
where the importance sampling ratio is defined as the geometric mean of the per-token ratios.
|
193 |
|
194 |
-
> Please see [our paper](
|
195 |
|
196 |
|
197 |
### Visual Consistency Learning
|
@@ -241,7 +239,7 @@ $$
|
|
241 |
|
242 |
where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\) indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
|
243 |
|
244 |
-
> Please see [our paper](
|
245 |
|
246 |
|
247 |
### Test-Time Scaling
|
|
|
16 |
|
17 |
# InternVL3_5-38B-Instruct
|
18 |
|
19 |
+
[\[π GitHub\]](https://github.com/OpenGVLab/InternVL) [\[π InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[π InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[π InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[π InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[π InternVL3\]](https://huggingface.co/papers/2504.10479) [\[π InternVL3.5\]](https://huggingface.co/papers/2508.18265)
|
20 |
|
21 |
[\[π Blog\]](https://internvl.github.io/blog/) [\[π¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[π Quick Start\]](#quick-start) [\[π Documents\]](https://internvl.readthedocs.io/en/latest/)
|
22 |
|
|
|
26 |
|
27 |
## Introduction
|
28 |
|
29 |
+
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβnarrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
|
|
|
|
|
30 |
|
31 |

|
32 |
|
|
|
189 |
|
190 |
where the importance sampling ratio is defined as the geometric mean of the per-token ratios.
|
191 |
|
192 |
+
> Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
|
193 |
|
194 |
|
195 |
### Visual Consistency Learning
|
|
|
239 |
|
240 |
where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\) indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
|
241 |
|
242 |
+
> Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
|
243 |
|
244 |
|
245 |
### Test-Time Scaling
|