MiniCPM-V-4_5 / README.md
orrzxz's picture
Update README.md
fcbedeb verified

A newer version of the Gradio SDK is available: 5.44.1

Upgrade
metadata
title: MiniCPM-V-4.5 Multimodal Chat
emoji: πŸš€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.44.0
app_file: app.py
pinned: false
license: apache-2.0

MiniCPM-V-4.5 Multimodal Chat πŸš€

A powerful Gradio interface for the MiniCPM-V-4.5 multimodal model - a GPT-4V level MLLM with only 8B parameters!

Features

  • πŸ“Έ Image Understanding: Analyze single or multiple images with high-resolution support (up to 1.8M pixels)
  • πŸŽ₯ Video Understanding: Process videos with high refresh rate (up to 10 FPS) and efficient compression
  • πŸ“„ Document Parsing: Strong OCR capabilities and PDF document parsing
  • 🧠 Thinking Modes: Choose between fast thinking for efficiency or deep thinking for complex problems
  • 🌍 Multilingual: Support for 30+ languages
  • βš™οΈ Customizable: Adjust FPS, context size, temperature, and system prompts

Model Capabilities

MiniCPM-V-4.5 achieves state-of-the-art performance across multiple benchmarks:

  • Surpasses GPT-4o-latest and Gemini-2.0 Pro on vision-language tasks
  • Leading OCR performance on OCRBench
  • Efficient video token compression (96x rate)
  • Trustworthy behaviors with multilingual support

Usage

  1. Upload: Choose an image or video file
  2. Configure: Adjust settings like FPS (for videos), context size, and temperature
  3. Prompt: Enter your question or use the system prompt for specific instructions
  4. Generate: Click the generate button to get the model's response

Examples

  • "What objects do you see in this image?"
  • "Describe the main action happening in this video"
  • "Read and transcribe any text visible in the image"
  • "Analyze this image from an artistic perspective"

Technical Details

  • Architecture: Built on Qwen3-8B and SigLIP2-400M
  • Parameters: 8B total parameters
  • Video Processing: 3D-Resampler with temporal understanding
  • Resolution: Supports images up to 1344x1344 pixels
  • Efficiency: 4x fewer visual tokens than most MLLMs

License

This model is released under the MiniCPM Model License. Free for academic research and commercial use after registration.

Citation

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={Nat Commun 16, 5509 (2025)},
  year={2025}
}