Improve model card: Add descriptive tags
Browse filesThis PR enhances the model card by adding more descriptive tags to improve discoverability on the Hugging Face Hub. The new tags `multimodal-llm`, `vision-language-model`, and `agent` better reflect the model's capabilities as described in the paper and GitHub README, which highlight its multimodal nature, reasoning abilities, and support for agentic tasks.
README.md
CHANGED
@@ -1,19 +1,22 @@
|
|
1 |
---
|
2 |
-
license: apache-2.0
|
3 |
-
pipeline_tag: image-text-to-text
|
4 |
-
library_name: transformers
|
5 |
base_model:
|
6 |
-
|
7 |
-
|
8 |
-
base_model_relation: merge
|
9 |
datasets:
|
10 |
-
|
11 |
-
|
12 |
language:
|
13 |
-
|
|
|
|
|
|
|
14 |
tags:
|
15 |
-
|
16 |
-
|
|
|
|
|
|
|
|
|
17 |
---
|
18 |
|
19 |
# InternVL3_5-14B-Pretrained
|
@@ -28,7 +31,7 @@ tags:
|
|
28 |
|
29 |
## Introduction
|
30 |
|
31 |
-
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0
|
32 |
|
33 |

|
34 |
|
@@ -142,7 +145,7 @@ Compared to InternVL3.5, InternVL3.5-Flash further integrates the *Visual Resolu
|
|
142 |
Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
|
143 |
In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
|
144 |
For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
|
145 |
-
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50
|
146 |
|
147 |
|
148 |

|
@@ -234,7 +237,7 @@ $$
|
|
234 |
\Bigg],
|
235 |
$$
|
236 |
|
237 |
-
where \\(\mathrm{KL}
|
238 |
|
239 |
|
240 |
`Router training`:
|
@@ -530,40 +533,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
|
|
530 |
# pure-text conversation (纯文本对话)
|
531 |
question = 'Hello, who are you?'
|
532 |
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
|
533 |
-
print(f'User: {question}
|
|
|
534 |
|
535 |
question = 'Can you tell me a story?'
|
536 |
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
|
537 |
-
print(f'User: {question}
|
|
|
538 |
|
539 |
# single-image single-round conversation (单图单轮对话)
|
540 |
-
question = '<image
|
|
|
541 |
response = model.chat(tokenizer, pixel_values, question, generation_config)
|
542 |
-
print(f'User: {question}
|
|
|
543 |
|
544 |
# single-image multi-round conversation (单图多轮对话)
|
545 |
-
question = '<image
|
|
|
546 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
547 |
-
print(f'User: {question}
|
|
|
548 |
|
549 |
question = 'Please write a poem according to the image.'
|
550 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
|
551 |
-
print(f'User: {question}
|
|
|
552 |
|
553 |
# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
|
554 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
555 |
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
|
556 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
557 |
|
558 |
-
question = '<image
|
|
|
559 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
560 |
history=None, return_history=True)
|
561 |
-
print(f'User: {question}
|
|
|
562 |
|
563 |
question = 'What are the similarities and differences between these two images.'
|
564 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
565 |
history=history, return_history=True)
|
566 |
-
print(f'User: {question}
|
|
|
567 |
|
568 |
# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
|
569 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
@@ -571,17 +584,21 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
|
|
571 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
572 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
573 |
|
574 |
-
question = 'Image-1: <image
|
|
|
|
|
575 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
576 |
num_patches_list=num_patches_list,
|
577 |
history=None, return_history=True)
|
578 |
-
print(f'User: {question}
|
|
|
579 |
|
580 |
question = 'What are the similarities and differences between these two images.'
|
581 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
582 |
num_patches_list=num_patches_list,
|
583 |
history=history, return_history=True)
|
584 |
-
print(f'User: {question}
|
|
|
585 |
|
586 |
# batch inference, single image per sample (单图批处理)
|
587 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
@@ -589,13 +606,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
|
|
589 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
590 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
591 |
|
592 |
-
questions = ['<image
|
|
|
593 |
responses = model.batch_chat(tokenizer, pixel_values,
|
594 |
num_patches_list=num_patches_list,
|
595 |
questions=questions,
|
596 |
generation_config=generation_config)
|
597 |
for question, response in zip(questions, responses):
|
598 |
-
print(f'User: {question}
|
|
|
599 |
|
600 |
# video multi-round conversation (视频多轮对话)
|
601 |
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
|
@@ -633,17 +652,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
|
|
633 |
video_path = './examples/red-panda.mp4'
|
634 |
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
|
635 |
pixel_values = pixel_values.to(torch.bfloat16).cuda()
|
636 |
-
video_prefix = ''.join([f'Frame{i+1}: <image
|
|
|
637 |
question = video_prefix + 'What is the red panda doing?'
|
638 |
-
# Frame1: <image
|
|
|
|
|
|
|
|
|
639 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
640 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
641 |
-
print(f'User: {question}
|
|
|
642 |
|
643 |
question = 'Describe this video in detail.'
|
644 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
645 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
646 |
-
print(f'User: {question}
|
|
|
647 |
```
|
648 |
|
649 |
#### Streaming Output
|
@@ -727,7 +753,9 @@ image_urls=[
|
|
727 |
|
728 |
images = [load_image(img_url) for img_url in image_urls]
|
729 |
# Numbering images improves multi-image conversations
|
730 |
-
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
|
|
|
|
731 |
print(response.text)
|
732 |
```
|
733 |
|
@@ -830,3 +858,14 @@ If you find this project useful in your research, please consider citing:
|
|
830 |
year={2025}
|
831 |
}
|
832 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
|
|
|
|
2 |
base_model:
|
3 |
+
- OpenGVLab/InternViT-300M-448px-V2_5
|
4 |
+
- Qwen/Qwen3-14B
|
|
|
5 |
datasets:
|
6 |
+
- OpenGVLab/MMPR-v1.2
|
7 |
+
- OpenGVLab/MMPR-Tiny
|
8 |
language:
|
9 |
+
- multilingual
|
10 |
+
library_name: transformers
|
11 |
+
license: apache-2.0
|
12 |
+
pipeline_tag: image-text-to-text
|
13 |
tags:
|
14 |
+
- internvl
|
15 |
+
- custom_code
|
16 |
+
- multimodal-llm
|
17 |
+
- vision-language-model
|
18 |
+
- agent
|
19 |
+
base_model_relation: merge
|
20 |
---
|
21 |
|
22 |
# InternVL3_5-14B-Pretrained
|
|
|
31 |
|
32 |
## Introduction
|
33 |
|
34 |
+
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
|
35 |
|
36 |

|
37 |
|
|
|
145 |
Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
|
146 |
In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
|
147 |
For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
|
148 |
+
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50% while maintaining nearly 100% of the performance of InternVL3.5.
|
149 |
|
150 |
|
151 |

|
|
|
237 |
\Bigg],
|
238 |
$$
|
239 |
|
240 |
+
where \\(\mathrm{KL}\\) denotes the KL divergence and \(\xi\) denotes the compression rate, which is uniformly sampled from \(\{\frac{1}{4},\frac{1}{16}\}\). The image \(I_\xi\) is represented as 256 tokens when \(\xi=\frac{1}{4}\) and 64 tokens when \(\xi=\frac{1}{16}\). Notably, the reference model always performs inference with \(\xi=\frac{1}{4}\).
|
241 |
|
242 |
|
243 |
`Router training`:
|
|
|
533 |
# pure-text conversation (纯文本对话)
|
534 |
question = 'Hello, who are you?'
|
535 |
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
|
536 |
+
print(f'User: {question}
|
537 |
+
Assistant: {response}')
|
538 |
|
539 |
question = 'Can you tell me a story?'
|
540 |
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
|
541 |
+
print(f'User: {question}
|
542 |
+
Assistant: {response}')
|
543 |
|
544 |
# single-image single-round conversation (单图单轮对话)
|
545 |
+
question = '<image>
|
546 |
+
Please describe the image shortly.'
|
547 |
response = model.chat(tokenizer, pixel_values, question, generation_config)
|
548 |
+
print(f'User: {question}
|
549 |
+
Assistant: {response}')
|
550 |
|
551 |
# single-image multi-round conversation (单图多轮对话)
|
552 |
+
question = '<image>
|
553 |
+
Please describe the image in detail.'
|
554 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
555 |
+
print(f'User: {question}
|
556 |
+
Assistant: {response}')
|
557 |
|
558 |
question = 'Please write a poem according to the image.'
|
559 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
|
560 |
+
print(f'User: {question}
|
561 |
+
Assistant: {response}')
|
562 |
|
563 |
# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
|
564 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
565 |
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
|
566 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
567 |
|
568 |
+
question = '<image>
|
569 |
+
Describe the two images in detail.'
|
570 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
571 |
history=None, return_history=True)
|
572 |
+
print(f'User: {question}
|
573 |
+
Assistant: {response}')
|
574 |
|
575 |
question = 'What are the similarities and differences between these two images.'
|
576 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
577 |
history=history, return_history=True)
|
578 |
+
print(f'User: {question}
|
579 |
+
Assistant: {response}')
|
580 |
|
581 |
# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
|
582 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
|
|
584 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
585 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
586 |
|
587 |
+
question = 'Image-1: <image>
|
588 |
+
Image-2: <image>
|
589 |
+
Describe the two images in detail.'
|
590 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
591 |
num_patches_list=num_patches_list,
|
592 |
history=None, return_history=True)
|
593 |
+
print(f'User: {question}
|
594 |
+
Assistant: {response}')
|
595 |
|
596 |
question = 'What are the similarities and differences between these two images.'
|
597 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
598 |
num_patches_list=num_patches_list,
|
599 |
history=history, return_history=True)
|
600 |
+
print(f'User: {question}
|
601 |
+
Assistant: {response}')
|
602 |
|
603 |
# batch inference, single image per sample (单图批处理)
|
604 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
|
|
606 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
607 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
608 |
|
609 |
+
questions = ['<image>
|
610 |
+
Describe the image in detail.'] * len(num_patches_list)
|
611 |
responses = model.batch_chat(tokenizer, pixel_values,
|
612 |
num_patches_list=num_patches_list,
|
613 |
questions=questions,
|
614 |
generation_config=generation_config)
|
615 |
for question, response in zip(questions, responses):
|
616 |
+
print(f'User: {question}
|
617 |
+
Assistant: {response}')
|
618 |
|
619 |
# video multi-round conversation (视频多轮对话)
|
620 |
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
|
|
|
652 |
video_path = './examples/red-panda.mp4'
|
653 |
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
|
654 |
pixel_values = pixel_values.to(torch.bfloat16).cuda()
|
655 |
+
video_prefix = ''.join([f'Frame{i+1}: <image>
|
656 |
+
' for i in range(len(num_patches_list))])
|
657 |
question = video_prefix + 'What is the red panda doing?'
|
658 |
+
# Frame1: <image>
|
659 |
+
Frame2: <image>
|
660 |
+
...
|
661 |
+
Frame8: <image>
|
662 |
+
{question}
|
663 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
664 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
665 |
+
print(f'User: {question}
|
666 |
+
Assistant: {response}')
|
667 |
|
668 |
question = 'Describe this video in detail.'
|
669 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
670 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
671 |
+
print(f'User: {question}
|
672 |
+
Assistant: {response}')
|
673 |
```
|
674 |
|
675 |
#### Streaming Output
|
|
|
753 |
|
754 |
images = [load_image(img_url) for img_url in image_urls]
|
755 |
# Numbering images improves multi-image conversations
|
756 |
+
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
757 |
+
Image-2: {IMAGE_TOKEN}
|
758 |
+
describe these two images', images))
|
759 |
print(response.text)
|
760 |
```
|
761 |
|
|
|
858 |
year={2025}
|
859 |
}
|
860 |
```
|
861 |
+
|
862 |
+
|
863 |
+
## Acknowledgement
|
864 |
+
|
865 |
+
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
866 |
+
|
867 |
+
______________________________________________________________________
|
868 |
+
|
869 |
+
Scan the following QR Code, join our WeChat group.
|
870 |
+
|
871 |
+
<p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>
|