tc-mb commited on
Commit
09dd28e
Β·
verified Β·
1 Parent(s): 8483b3b

Update: README

Browse files
Files changed (1) hide show
  1. README.md +112 -6
README.md CHANGED
@@ -14,7 +14,7 @@ tags:
14
  - custom_code
15
  ---
16
 
17
- <h1>A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone</h1>
18
 
19
  [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Demo](http://101.126.42.235:30910/)</a>
20
 
@@ -27,12 +27,12 @@ tags:
27
  - πŸ”₯ **State-of-the-art Vision-Language Capability.**
28
  MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters.
29
 
30
- - 🎬 **Efficient High Refresh Rate and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can percieve significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high refresh rate (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently.
31
 
32
  - βš™οΈ **Controllable Hybrid Fast/Deep Thinking.** MiniCPM-V 4.5 supports both fast thinking for efficient frequent usage with competitive performance, and deep thinking for more complex problem solving. To cover efficiency and performance trade-offs in different user scenarios, this fast/deep thinking mode can be switched in a highly controlled fashion.
33
 
34
  - πŸ’ͺ **Strong OCR, Document Parsing and Others.**
35
- Based on [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x less visual tokens than most MLLMs. The model achieves **leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5**. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o-latest on MMHal-Bench, and supports **multilingual capabilities** in more than 30 languages.
36
 
37
  - πŸ’« **Easy Usage.**
38
  MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/Support-MiniCPM-V-4.5/docs/multimodal/minicpmv4.5.md) and [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4), [GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) and [AWQ](https://github.com/tc-mb/AutoAWQ) format quantized models in 16 sizes, (3) [SGLang](https://github.com/tc-mb/sglang/tree/main) and [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [Transformers](https://github.com/tc-mb/transformers/tree/main) and [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), (6) optimized [local iOS app](https://github.com/tc-mb/MiniCPM-o-demo-iOS) on iPhone and iPad, and (7) online web demo on [server](http://101.126.42.235:30910/). See our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) for full usages!
@@ -45,9 +45,9 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
45
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpm-v-4dot5-framework.png" , width=100%>
46
  </div>
47
 
48
- - **Architechture: Unified 3D-Resampler for High-density Video Compression.** MiniCPM-V 4.5 introduces a 3D-Resampler that overcomes the performance-efficiency trade-off in video understanding. By grouping and jointly compressing up to 6 consecutive video frames into just 64 tokens (the same token count used for a single image in MiniCPM-V series), MiniCPM-V 4.5 achieves a 96Γ— compression rate for video tokens. This allows the model to process more video frames without additional LLM computational cost, enabling high refresh rate video and long video understanding. The architecture supports unified encoding for images, multi-image inputs, and videos, ensuring seamless capability and knowledge transfer.
49
 
50
- - **Pre-training: Unified Learning for OCR and Knowledge from Documents.** Existing MLLMs learn OCR capability and knowledge from documents in isolated training approaches. We observe the essential difference between these two training approaches is the visibility of the text in images. By dynamically corrupting text regions in documents with varying noise levels and asking the model to reconstruct the text, the model learns to adaptively and properly switch between accurate text recognition (when text is visible) and multimodal context-based knowledge reasoning (when text is heavily obscured). This eliminates reliance on error-prone document parsers in knowledge learning from documents, and prevents hallucinations from over-augmented OCR data, resulting in top-tier OCR and multimodal knowledge performance with minimal engineering overhead.
51
 
52
  - **Post-training: Hybrid Fast/Deep Thinking with Multimodal RL.** MiniCPM-V 4.5 offers a balanced reasoning experience through two switchable modes: fast thinking for efficient daily use and deep thinking for complex tasks. Using a new hybrid reinforcement learning method, the model jointly optimizes both modes, significantly enhancing fast-mode performance without compromising deep-mode capability. Incorporated with [RLPR](https://github.com/OpenBMB/RLPR) and [RLAIF-V](https://github.com/RLHF-V/RLAIF-V), it generalizes robust reasoning skills from broad multimodal data while effectively reducing hallucinations.
53
 
@@ -60,6 +60,83 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
60
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv_4_5_evaluation_result.png" , width=100%>
61
  </div>
62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  ### Examples
64
 
65
  <div align="center">
@@ -72,7 +149,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
72
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
73
  </div>
74
 
75
- We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without edition.
76
 
77
  <div align="center">
78
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_en_handwriting.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
@@ -260,6 +337,35 @@ answer = model.chat(
260
  print(answer)
261
  ```
262
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
263
 
264
  ## License
265
  #### Model License
 
14
  - custom_code
15
  ---
16
 
17
+ <h1>A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone</h1>
18
 
19
  [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Demo](http://101.126.42.235:30910/)</a>
20
 
 
27
  - πŸ”₯ **State-of-the-art Vision-Language Capability.**
28
  MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters.
29
 
30
+ - 🎬 **Efficient High-FPS and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can perceive significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high-FPS (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently.
31
 
32
  - βš™οΈ **Controllable Hybrid Fast/Deep Thinking.** MiniCPM-V 4.5 supports both fast thinking for efficient frequent usage with competitive performance, and deep thinking for more complex problem solving. To cover efficiency and performance trade-offs in different user scenarios, this fast/deep thinking mode can be switched in a highly controlled fashion.
33
 
34
  - πŸ’ͺ **Strong OCR, Document Parsing and Others.**
35
+ Based on [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x less visual tokens than most MLLMs. The model achieves **leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5**. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o-latest on MMHal-Bench, and supports **multilingual capabilities** in more than 30 languages.
36
 
37
  - πŸ’« **Easy Usage.**
38
  MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/Support-MiniCPM-V-4.5/docs/multimodal/minicpmv4.5.md) and [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4), [GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) and [AWQ](https://github.com/tc-mb/AutoAWQ) format quantized models in 16 sizes, (3) [SGLang](https://github.com/tc-mb/sglang/tree/main) and [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [Transformers](https://github.com/tc-mb/transformers/tree/main) and [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), (6) optimized [local iOS app](https://github.com/tc-mb/MiniCPM-o-demo-iOS) on iPhone and iPad, and (7) online web demo on [server](http://101.126.42.235:30910/). See our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) for full usages!
 
45
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpm-v-4dot5-framework.png" , width=100%>
46
  </div>
47
 
48
+ - **Architechture: Unified 3D-Resampler for High-density Video Compression.** MiniCPM-V 4.5 introduces a 3D-Resampler that overcomes the performance-efficiency trade-off in video understanding. By grouping and jointly compressing up to 6 consecutive video frames into just 64 tokens (the same token count used for a single image in MiniCPM-V series), MiniCPM-V 4.5 achieves a 96Γ— compression rate for video tokens. This allows the model to process more video frames without additional LLM computational cost, enabling high-FPS video and long video understanding. The architecture supports unified encoding for images, multi-image inputs, and videos, ensuring seamless capability and knowledge transfer.
49
 
50
+ - **Pre-training: Unified Learning for OCR and Knowledge from Documents.** Existing MLLMs learn OCR capability and knowledge from documents in isolated training approaches. We observe that the essential difference between these two training approaches is the visibility of the text in images. By dynamically corrupting text regions in documents with varying noise levels and asking the model to reconstruct the text, the model learns to adaptively and properly switch between accurate text recognition (when text is visible) and multimodal context-based knowledge reasoning (when text is heavily obscured). This eliminates reliance on error-prone document parsers in knowledge learning from documents, and prevents hallucinations from over-augmented OCR data, resulting in top-tier OCR and multimodal knowledge performance with minimal engineering overhead.
51
 
52
  - **Post-training: Hybrid Fast/Deep Thinking with Multimodal RL.** MiniCPM-V 4.5 offers a balanced reasoning experience through two switchable modes: fast thinking for efficient daily use and deep thinking for complex tasks. Using a new hybrid reinforcement learning method, the model jointly optimizes both modes, significantly enhancing fast-mode performance without compromising deep-mode capability. Incorporated with [RLPR](https://github.com/OpenBMB/RLPR) and [RLAIF-V](https://github.com/RLHF-V/RLAIF-V), it generalizes robust reasoning skills from broad multimodal data while effectively reducing hallucinations.
53
 
 
60
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv_4_5_evaluation_result.png" , width=100%>
61
  </div>
62
 
63
+ ### Inference Efficiency
64
+
65
+ **OpenCompass**
66
+ <div align="left">
67
+ <table style="margin: 0px auto;">
68
+ <thead>
69
+ <tr>
70
+ <th align="left">Model</th>
71
+ <th>Size</th>
72
+ <th>Avg Score ↑</th>
73
+ <th>Total Inference Time ↓</th>
74
+ </tr>
75
+ </thead>
76
+ <tbody align="center">
77
+ <tr>
78
+ <td nowrap="nowrap" align="left">GLM-4.1V-9B-Thinking</td>
79
+ <td>10.3B</td>
80
+ <td>76.6</td>
81
+ <td>17.5h</td>
82
+ </tr>
83
+ <tr>
84
+ <td nowrap="nowrap" align="left">MiMo-VL-7B-RL</td>
85
+ <td>8.3B</td>
86
+ <td>76.4</td>
87
+ <td>11h</td>
88
+ </tr>
89
+ <tr>
90
+ <td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
91
+ <td>8.7B</td>
92
+ <td><b>77.0</td>
93
+ <td><b>7.5h</td>
94
+ </tr>
95
+ </tbody>
96
+ </table>
97
+ </div>
98
+
99
+ **Video-MME**
100
+
101
+ <div align="left">
102
+ <table style="margin: 0px auto;">
103
+ <thead>
104
+ <tr>
105
+ <th align="left">Model</th>
106
+ <th>Size</th>
107
+ <th>Avg Score ↑</th>
108
+ <th>Total Inference Time ↓</th>
109
+ <th>GPU Mem ↓</th>
110
+ </tr>
111
+ </thead>
112
+ <tbody align="center">
113
+ <tr>
114
+ <td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
115
+ <td>8.3B</td>
116
+ <td>71.6</td>
117
+ <td>3h</td>
118
+ <td>60G</td>
119
+ </tr>
120
+ <tr>
121
+ <td nowrap="nowrap" align="left">GLM-4.1V-9B-Thinking</td>
122
+ <td>10.3B</td>
123
+ <td><b>73.6</td>
124
+ <td>2.63h</td>
125
+ <td>32G</td>
126
+ </tr>
127
+ <tr>
128
+ <td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
129
+ <td>8.7B</td>
130
+ <td>73.5</td>
131
+ <td><b>0.26h</td>
132
+ <td><b>28G</td>
133
+ </tr>
134
+ </tbody>
135
+ </table>
136
+ </div>
137
+
138
+ Both Video-MME and OpenCompass were evaluated using 8Γ—A100 GPUs for inference. The reported inference time of Video-MME excludes the cost of video frame extraction.
139
+
140
  ### Examples
141
 
142
  <div align="center">
 
149
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
150
  </div>
151
 
152
+ We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without editing.
153
 
154
  <div align="center">
155
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_en_handwriting.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
 
337
  print(answer)
338
  ```
339
 
340
+ #### Chat with multiple images
341
+ <details>
342
+ <summary> Click to show Python code running MiniCPM-V 4.5 with multiple images input. </summary>
343
+
344
+ ```python
345
+ import torch
346
+ from PIL import Image
347
+ from transformers import AutoModel, AutoTokenizer
348
+
349
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
350
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2
351
+ model = model.eval().cuda()
352
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)
353
+
354
+ image1 = Image.open('image1.jpg').convert('RGB')
355
+ image2 = Image.open('image2.jpg').convert('RGB')
356
+ question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
357
+
358
+ msgs = [{'role': 'user', 'content': [image1, image2, question]}]
359
+
360
+ answer = model.chat(
361
+ image=None,
362
+ msgs=msgs,
363
+ tokenizer=tokenizer
364
+ )
365
+ print(answer)
366
+ ```
367
+ </details>
368
+
369
 
370
  ## License
371
  #### Model License