File size: 2,485 Bytes
0b22924
4502601
 
 
 
0b22924
fcbedeb
0b22924
 
4502601
0b22924
 
4502601
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
title: MiniCPM-V-4.5 Multimodal Chat
emoji: πŸš€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.44.0
app_file: app.py
pinned: false
license: apache-2.0
---

# MiniCPM-V-4.5 Multimodal Chat πŸš€

A powerful Gradio interface for the MiniCPM-V-4.5 multimodal model - a GPT-4V level MLLM with only 8B parameters!

## Features

- πŸ“Έ **Image Understanding**: Analyze single or multiple images with high-resolution support (up to 1.8M pixels)
- πŸŽ₯ **Video Understanding**: Process videos with high refresh rate (up to 10 FPS) and efficient compression
- πŸ“„ **Document Parsing**: Strong OCR capabilities and PDF document parsing
- 🧠 **Thinking Modes**: Choose between fast thinking for efficiency or deep thinking for complex problems  
- 🌍 **Multilingual**: Support for 30+ languages
- βš™οΈ **Customizable**: Adjust FPS, context size, temperature, and system prompts

## Model Capabilities

MiniCPM-V-4.5 achieves state-of-the-art performance across multiple benchmarks:
- Surpasses GPT-4o-latest and Gemini-2.0 Pro on vision-language tasks
- Leading OCR performance on OCRBench
- Efficient video token compression (96x rate)
- Trustworthy behaviors with multilingual support

## Usage

1. **Upload**: Choose an image or video file
2. **Configure**: Adjust settings like FPS (for videos), context size, and temperature
3. **Prompt**: Enter your question or use the system prompt for specific instructions
4. **Generate**: Click the generate button to get the model's response

## Examples

- "What objects do you see in this image?"
- "Describe the main action happening in this video"  
- "Read and transcribe any text visible in the image"
- "Analyze this image from an artistic perspective"

## Technical Details

- **Architecture**: Built on Qwen3-8B and SigLIP2-400M
- **Parameters**: 8B total parameters
- **Video Processing**: 3D-Resampler with temporal understanding
- **Resolution**: Supports images up to 1344x1344 pixels
- **Efficiency**: 4x fewer visual tokens than most MLLMs

## License

This model is released under the MiniCPM Model License. Free for academic research and commercial use after registration.

## Citation

```bibtex
@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={Nat Commun 16, 5509 (2025)},
  year={2025}
}
```