Spaces:
Running
Running
Create README.md
Browse files
README.md
CHANGED
@@ -1,12 +1,69 @@
|
|
1 |
---
|
2 |
-
title: MiniCPM-V-4
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
-
sdk_version:
|
8 |
app_file: app.py
|
9 |
pinned: false
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: MiniCPM-V-4.5 Multimodal Chat
|
3 |
+
emoji: π
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: purple
|
6 |
sdk: gradio
|
7 |
+
sdk_version: "4.0.0"
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
+
license: apache-2.0
|
11 |
---
|
12 |
|
13 |
+
# MiniCPM-V-4.5 Multimodal Chat π
|
14 |
+
|
15 |
+
A powerful Gradio interface for the MiniCPM-V-4.5 multimodal model - a GPT-4V level MLLM with only 8B parameters!
|
16 |
+
|
17 |
+
## Features
|
18 |
+
|
19 |
+
- πΈ **Image Understanding**: Analyze single or multiple images with high-resolution support (up to 1.8M pixels)
|
20 |
+
- π₯ **Video Understanding**: Process videos with high refresh rate (up to 10 FPS) and efficient compression
|
21 |
+
- π **Document Parsing**: Strong OCR capabilities and PDF document parsing
|
22 |
+
- π§ **Thinking Modes**: Choose between fast thinking for efficiency or deep thinking for complex problems
|
23 |
+
- π **Multilingual**: Support for 30+ languages
|
24 |
+
- βοΈ **Customizable**: Adjust FPS, context size, temperature, and system prompts
|
25 |
+
|
26 |
+
## Model Capabilities
|
27 |
+
|
28 |
+
MiniCPM-V-4.5 achieves state-of-the-art performance across multiple benchmarks:
|
29 |
+
- Surpasses GPT-4o-latest and Gemini-2.0 Pro on vision-language tasks
|
30 |
+
- Leading OCR performance on OCRBench
|
31 |
+
- Efficient video token compression (96x rate)
|
32 |
+
- Trustworthy behaviors with multilingual support
|
33 |
+
|
34 |
+
## Usage
|
35 |
+
|
36 |
+
1. **Upload**: Choose an image or video file
|
37 |
+
2. **Configure**: Adjust settings like FPS (for videos), context size, and temperature
|
38 |
+
3. **Prompt**: Enter your question or use the system prompt for specific instructions
|
39 |
+
4. **Generate**: Click the generate button to get the model's response
|
40 |
+
|
41 |
+
## Examples
|
42 |
+
|
43 |
+
- "What objects do you see in this image?"
|
44 |
+
- "Describe the main action happening in this video"
|
45 |
+
- "Read and transcribe any text visible in the image"
|
46 |
+
- "Analyze this image from an artistic perspective"
|
47 |
+
|
48 |
+
## Technical Details
|
49 |
+
|
50 |
+
- **Architecture**: Built on Qwen3-8B and SigLIP2-400M
|
51 |
+
- **Parameters**: 8B total parameters
|
52 |
+
- **Video Processing**: 3D-Resampler with temporal understanding
|
53 |
+
- **Resolution**: Supports images up to 1344x1344 pixels
|
54 |
+
- **Efficiency**: 4x fewer visual tokens than most MLLMs
|
55 |
+
|
56 |
+
## License
|
57 |
+
|
58 |
+
This model is released under the MiniCPM Model License. Free for academic research and commercial use after registration.
|
59 |
+
|
60 |
+
## Citation
|
61 |
+
|
62 |
+
```bibtex
|
63 |
+
@article{yao2024minicpm,
|
64 |
+
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
|
65 |
+
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
|
66 |
+
journal={Nat Commun 16, 5509 (2025)},
|
67 |
+
year={2025}
|
68 |
+
}
|
69 |
+
```
|