MiniCPM-V-4_5

Running

App Files Files Community

orrzxz commited on 14 days ago

Commit

4502601

verified ·

1 Parent(s): 159c520

Create README.md

Browse files

Files changed (1) hide show

README.md +63 -6

README.md CHANGED Viewed

@@ -1,12 +1,69 @@
 ---
-title: MiniCPM-V-4 5
-emoji: 🐠
-colorFrom: purple
-colorTo: pink
 sdk: gradio
-sdk_version: 5.44.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: MiniCPM-V-4.5 Multimodal Chat
+emoji: 🚀
+colorFrom: blue
+colorTo: purple
 sdk: gradio
+sdk_version: "4.0.0"
 app_file: app.py
 pinned: false
+license: apache-2.0
 ---
+# MiniCPM-V-4.5 Multimodal Chat 🚀
+A powerful Gradio interface for the MiniCPM-V-4.5 multimodal model - a GPT-4V level MLLM with only 8B parameters!
+## Features
+- 📸 **Image Understanding**: Analyze single or multiple images with high-resolution support (up to 1.8M pixels)
+- 🎥 **Video Understanding**: Process videos with high refresh rate (up to 10 FPS) and efficient compression
+- 📄 **Document Parsing**: Strong OCR capabilities and PDF document parsing
+- 🧠 **Thinking Modes**: Choose between fast thinking for efficiency or deep thinking for complex problems
+- 🌍 **Multilingual**: Support for 30+ languages
+- ⚙️ **Customizable**: Adjust FPS, context size, temperature, and system prompts
+## Model Capabilities
+MiniCPM-V-4.5 achieves state-of-the-art performance across multiple benchmarks:
+- Surpasses GPT-4o-latest and Gemini-2.0 Pro on vision-language tasks
+- Leading OCR performance on OCRBench
+- Efficient video token compression (96x rate)
+- Trustworthy behaviors with multilingual support
+## Usage
+1. **Upload**: Choose an image or video file
+2. **Configure**: Adjust settings like FPS (for videos), context size, and temperature
+3. **Prompt**: Enter your question or use the system prompt for specific instructions
+4. **Generate**: Click the generate button to get the model's response
+## Examples
+- "What objects do you see in this image?"
+- "Describe the main action happening in this video"
+- "Read and transcribe any text visible in the image"
+- "Analyze this image from an artistic perspective"
+## Technical Details
+- **Architecture**: Built on Qwen3-8B and SigLIP2-400M
+- **Parameters**: 8B total parameters
+- **Video Processing**: 3D-Resampler with temporal understanding
+- **Resolution**: Supports images up to 1344x1344 pixels
+- **Efficiency**: 4x fewer visual tokens than most MLLMs
+## License
+This model is released under the MiniCPM Model License. Free for academic research and commercial use after registration.
+## Citation
+```bibtex
+@article{yao2024minicpm,
+  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
+  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
+  journal={Nat Commun 16, 5509 (2025)},
+  year={2025}
+}
+```