Spaces:
Running
Running
title: MiniCPM-V-4.5 Multimodal Chat | |
emoji: π | |
colorFrom: blue | |
colorTo: purple | |
sdk: gradio | |
sdk_version: 5.44.0 | |
app_file: app.py | |
pinned: false | |
license: apache-2.0 | |
# MiniCPM-V-4.5 Multimodal Chat π | |
A powerful Gradio interface for the MiniCPM-V-4.5 multimodal model - a GPT-4V level MLLM with only 8B parameters! | |
## Features | |
- πΈ **Image Understanding**: Analyze single or multiple images with high-resolution support (up to 1.8M pixels) | |
- π₯ **Video Understanding**: Process videos with high refresh rate (up to 10 FPS) and efficient compression | |
- π **Document Parsing**: Strong OCR capabilities and PDF document parsing | |
- π§ **Thinking Modes**: Choose between fast thinking for efficiency or deep thinking for complex problems | |
- π **Multilingual**: Support for 30+ languages | |
- βοΈ **Customizable**: Adjust FPS, context size, temperature, and system prompts | |
## Model Capabilities | |
MiniCPM-V-4.5 achieves state-of-the-art performance across multiple benchmarks: | |
- Surpasses GPT-4o-latest and Gemini-2.0 Pro on vision-language tasks | |
- Leading OCR performance on OCRBench | |
- Efficient video token compression (96x rate) | |
- Trustworthy behaviors with multilingual support | |
## Usage | |
1. **Upload**: Choose an image or video file | |
2. **Configure**: Adjust settings like FPS (for videos), context size, and temperature | |
3. **Prompt**: Enter your question or use the system prompt for specific instructions | |
4. **Generate**: Click the generate button to get the model's response | |
## Examples | |
- "What objects do you see in this image?" | |
- "Describe the main action happening in this video" | |
- "Read and transcribe any text visible in the image" | |
- "Analyze this image from an artistic perspective" | |
## Technical Details | |
- **Architecture**: Built on Qwen3-8B and SigLIP2-400M | |
- **Parameters**: 8B total parameters | |
- **Video Processing**: 3D-Resampler with temporal understanding | |
- **Resolution**: Supports images up to 1344x1344 pixels | |
- **Efficiency**: 4x fewer visual tokens than most MLLMs | |
## License | |
This model is released under the MiniCPM Model License. Free for academic research and commercial use after registration. | |
## Citation | |
```bibtex | |
@article{yao2024minicpm, | |
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, | |
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others}, | |
journal={Nat Commun 16, 5509 (2025)}, | |
year={2025} | |
} | |
``` |