orrzxz commited on
Commit
4502601
Β·
verified Β·
1 Parent(s): 159c520

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -6
README.md CHANGED
@@ -1,12 +1,69 @@
1
  ---
2
- title: MiniCPM-V-4 5
3
- emoji: 🐠
4
- colorFrom: purple
5
- colorTo: pink
6
  sdk: gradio
7
- sdk_version: 5.44.0
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: MiniCPM-V-4.5 Multimodal Chat
3
+ emoji: πŸš€
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: "4.0.0"
8
  app_file: app.py
9
  pinned: false
10
+ license: apache-2.0
11
  ---
12
 
13
+ # MiniCPM-V-4.5 Multimodal Chat πŸš€
14
+
15
+ A powerful Gradio interface for the MiniCPM-V-4.5 multimodal model - a GPT-4V level MLLM with only 8B parameters!
16
+
17
+ ## Features
18
+
19
+ - πŸ“Έ **Image Understanding**: Analyze single or multiple images with high-resolution support (up to 1.8M pixels)
20
+ - πŸŽ₯ **Video Understanding**: Process videos with high refresh rate (up to 10 FPS) and efficient compression
21
+ - πŸ“„ **Document Parsing**: Strong OCR capabilities and PDF document parsing
22
+ - 🧠 **Thinking Modes**: Choose between fast thinking for efficiency or deep thinking for complex problems
23
+ - 🌍 **Multilingual**: Support for 30+ languages
24
+ - βš™οΈ **Customizable**: Adjust FPS, context size, temperature, and system prompts
25
+
26
+ ## Model Capabilities
27
+
28
+ MiniCPM-V-4.5 achieves state-of-the-art performance across multiple benchmarks:
29
+ - Surpasses GPT-4o-latest and Gemini-2.0 Pro on vision-language tasks
30
+ - Leading OCR performance on OCRBench
31
+ - Efficient video token compression (96x rate)
32
+ - Trustworthy behaviors with multilingual support
33
+
34
+ ## Usage
35
+
36
+ 1. **Upload**: Choose an image or video file
37
+ 2. **Configure**: Adjust settings like FPS (for videos), context size, and temperature
38
+ 3. **Prompt**: Enter your question or use the system prompt for specific instructions
39
+ 4. **Generate**: Click the generate button to get the model's response
40
+
41
+ ## Examples
42
+
43
+ - "What objects do you see in this image?"
44
+ - "Describe the main action happening in this video"
45
+ - "Read and transcribe any text visible in the image"
46
+ - "Analyze this image from an artistic perspective"
47
+
48
+ ## Technical Details
49
+
50
+ - **Architecture**: Built on Qwen3-8B and SigLIP2-400M
51
+ - **Parameters**: 8B total parameters
52
+ - **Video Processing**: 3D-Resampler with temporal understanding
53
+ - **Resolution**: Supports images up to 1344x1344 pixels
54
+ - **Efficiency**: 4x fewer visual tokens than most MLLMs
55
+
56
+ ## License
57
+
58
+ This model is released under the MiniCPM Model License. Free for academic research and commercial use after registration.
59
+
60
+ ## Citation
61
+
62
+ ```bibtex
63
+ @article{yao2024minicpm,
64
+ title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
65
+ author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
66
+ journal={Nat Commun 16, 5509 (2025)},
67
+ year={2025}
68
+ }
69
+ ```