JamesHujy commited on
Commit
4d553c2
·
1 Parent(s): 8b40eaf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -13
README.md CHANGED
@@ -1,26 +1,97 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
- - zh
5
  - en
 
6
  ---
 
7
 
8
- # VisCPM
9
-
10
- [GITHUB](https://github.com/OpenBMB/VisCPM)
11
 
12
- `VisCPM` is a family of open-source large multimodal models, which support multimodal conversational capabilities (`VisCPM-Chat` model) and text-to-image generation capabilities (`VisCPM-Paint` model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. `VisCPM` is trained based on the large language model [CPM-Bee](https://huggingface.co/openbmb/cpm-bee-10b) with 10B parameters, fusing visual encoder (`Q-Former`) and visual decoder (`Diffusion-UNet`) to support visual inputs and outputs. Thanks to the good bilingual capability of `CPM-Bee`, `VisCPM` can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.
13
 
14
- ## VisCPM-Chat
15
- `VisCPM-Chat` supports bilingual multimodal conversations involving images in both Chinese and English. The model utilizes `Q-Former` as the visual encoder and CPM-Bee (10B) as the base LLM. It combines visual and language models through language modeling training objectives. The model training consists of two stages: pretraining and instruction fine-tuning.
 
 
16
 
17
- * Pretrain: `VisCPM-Chat` was pretrained using approximately 100 million high-quality English multimodal data pairs. The data sources include CC3M, CC12M, COCO, Visual Genome, Laion, and others. In this stage, the language model parameters remain fixed, and only the parameters of the `Q-Former` are updated to enable efficient alignment of large-scale visual-language representations.
18
 
19
- * Instruction fine-tuning: We utilized the [LLaVA-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) dataset, which consists of English multimodal instruction-following dataset. We mixed this data with corresponding translated Chinese data to fine-tune the model and align its multimodal capabilities with user intents. In this phase, we updated all model parameters to improve the utilization efficiency of the instruction fine-tuning data. Interestingly, we observed that even when using only English instruction data for fine-tuning, the model can comprehend Chinese questions but can only respond in English. This indicates that the model has achieved good generalization in terms of its multilingual and multimodal capabilities. By incorporating a small amount of translated Chinese data during the instruction fine-tuning phase, we can align the model's response language with the user's question language.
20
 
21
- We evaluated the model on the LLaVA English test set and the translated Chinese test set. The evaluation benchmark examined the model's performance in open-domain conversations, image detail descriptions, and complex reasoning tasks, using GPT-4 for scoring. It is evident that `VisCPM-Chat` achieved the best average performance in Chinese multimodal capabilities, excelling in general-domain conversations and complex reasoning. It also demonstrated commendable English multimodal abilities.
 
 
22
 
23
  ## VisCPM-Paint
24
- `VisCPM-Paint` supports bilingual text-to-image generation. The model uses CPM-Bee (10B) as the text encoder, `UNet` as the image decoder, and trains the fusion of language and visual models using the objective of diffusion model. During the training process, the parameters of the language model remain fixed. The visual decoder is initialized with the parameters of [Stable Diffusion 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1), and it is fused with the language model by gradually unfreezing key bridging parameters: first training a linear layer to map text representations to the visual model, and then further unfreezing the cross-attention layers of `UNet`. The model was trained on the [LAION 2B](https://huggingface.co/datasets/laion/laion2B-en) English text-image pair dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- Similar to `VisCPM-Chat`, we found that thanks to the bilingual ability of CPM-Bee, `VisCPM-Paint` can be trained only by English image-text pairs, and generalized to achieve good Chinese text-to-image generation capabilities, reaching the Chinese open source model best effect. By further adding 20M cleaned original Chinese image-text data and 120M image-text data translated into Chinese, the Chinese text-to-image generation capability of the model can be further improved.
 
1
  ---
 
2
  language:
 
3
  - en
4
+ - zh
5
  ---
6
+ <div align="center">
7
 
8
+ **VisCPM**
 
 
9
 
10
+ **Chinese-English bilingual multi-modal large model series based on CPM (Chinese Pretrained Models) basic model**
11
 
12
+ <p align="center">
13
+ <a href="https://github.com/OpenBMB/VisCPM">Github</a>
14
+ <a href="https://huggingface.co/openbmb/VisCPM-Chat">VisCPM-Chat</a>
15
+ </p>
16
 
17
+ </div>
18
 
19
+ `VisCPM` is a family of open-source large multimodal models, which support multimodal conversational capabilities (`VisCPM-Chat` model) and text-to-image generation capabilities (`VisCPM-Paint` model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. VisCPM is trained based on the large language model [CPM-Bee](https://github.com/OpenBMB/CPM-Bee) with 10B parameters, fusing visual encoder (Q-Former) and visual decoder (Diffusion-UNet) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, `VisCPM` can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.
20
 
21
+ - **👐 Open-source Usage**: VisCPM is free to be used for personal and research purposes. By open-sourcing the VisCPM model family, we hope to promote the development of the open-source community of large multimodal models and related research.
22
+ - **🌟 Image and text generation coverage**: VisCPM models provide relatively comprehensive support for image and text multimodal capabilities, covering both multimodal conversation (image-to-text generation) capabilities and text-to-image generation capabilities.
23
+ - **💫 Excellent bilingual performance**: Thanks to the excellent bilingual capability of the base language model CPM-Bee, VisCPM achieves outstanding results in both bilingual multimodal conversation and text-to-image generation.
24
 
25
  ## VisCPM-Paint
26
+ `VisCPM-Paint` supports bilingual text-to-image generation. The model uses `CPM-Bee` as the text encoder, `UNet` as the image decoder, and fuses vision and language models using the objective of diffusion model. During the training process, the parameters of the language model remain fixed. The visual decoder is initialized with the parameters of [Stable Diffusion 2.1](https://github.com/Stability-AI/stablediffusion), and it is fused with the language model by gradually unfreezing key bridging parameters. The model is trained on the [LAION 2B](https://laion.ai/) English text-image pair dataset.
27
+
28
+ Similar to `VisCPM-Chat`, we found that due to the bilingual capability of `CPM-Bee`, `VisCPM-Paint` can achieve good Chinese text-to-image generation by training only on English text-image pairs, surpassing the performance of Chinese open-source models. By incorporating an additional 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese, the model's Chinese text-to-image generation ability can be further improved. We sample 30,000 images from the standard image generation test set MSCOCO and calculated commonly used evaluation metrics FID (Fréchet Inception Distance) to assess the quality of generated images. Similarly, we provide two versions of the model, namely `VisCPM-Paint-balance` and `VisCPM-Paint-zhplus`. The former has a balanced ability in both English and Chinese, while the latter emphasizes Chinese proficiency. `VisCPM-Paint-balance` is trained only using English text-image pairs, while `VisCPM-Paint-zhplus` incorporates an additional 20M native Chinese text-image pairs and 120M translated text-image pairs in Chinese based on `VisCPM-Paint-balance`.
29
+
30
+ <table align="center">
31
+ <tr>
32
+ <td align="center" rowspan="2">模型</td>
33
+ <td align="center" colspan="2">Zero-shot FID↓</td>
34
+ </tr>
35
+ <tr>
36
+ <td align="center">英文</td>
37
+ <td align="center">中文</td>
38
+ </tr>
39
+ <tr>
40
+ <td align="center">GLIDE</td>
41
+ <td align="center">12.2</td>
42
+ <td align="center">-</td>
43
+ </tr>
44
+ <tr>
45
+ <td align="center">Make-A-Scene</td>
46
+ <td align="center">11.8</td>
47
+ <td align="center">-</td>
48
+ </tr>
49
+ <tr>
50
+ <td align="center">DALL·E-2</td>
51
+ <td align="center">10.4</td>
52
+ <td align="center">-</td>
53
+ </tr>
54
+ <tr>
55
+ <td align="center">Unidiffuser</td>
56
+ <td align="center">9.7</td>
57
+ <td align="center">-</td>
58
+ </tr>
59
+ <tr>
60
+ <td align="center">Cogview2</td>
61
+ <td align="center">-</td>
62
+ <td align="center">24.0</td>
63
+ </tr>
64
+ <tr>
65
+ <td align="center">Stable Diffusion</td>
66
+ <td align="center"><b><span style="color:#c00000;">8.6</span></b></td>
67
+ <td align="center">-</td>
68
+ </tr>
69
+ <tr>
70
+ <td align="center">AltDiffusion</td>
71
+ <td align="center">17.2</td>
72
+ <td align="center">16.1</td>
73
+ </tr>
74
+ <tr>
75
+ <td align="center">TaiyiDiffusion</td>
76
+ <td align="center">-</td>
77
+ <td align="center">15.6</td>
78
+ </tr>
79
+ <tr>
80
+ <td align="center">VisCPM-Paint-balance</td>
81
+ <td align="center">9.5</td>
82
+ <td align="center">10.9</td>
83
+ </tr>
84
+ <tr>
85
+ <td align="center">VisCPM-Paint-zhplus</td>
86
+ <td align="center">9.9</td>
87
+ <td align="center"><b><span style="color:#c00000;">9.6</span></b></td>
88
+ </tr>
89
+ </table>
90
+
91
+
92
+
93
+ ## 📝 License
94
+
95
+ VisCPM is governed by the [GML License](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E9%9D%9E%E5%95%86%E4%B8%9A%E5%8C%96.md), and permits individual and research usages. If you intend to utilize the model for commercial purposes, please reach out to [email protected] to negotiate commercial licensing.
96
 
97
+ The CPM-Bee base, governed by the [General Model License (GML)](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md), permits commercial usage. If you intend to utilize the model for commercial purposes, please reach out to [email protected] to obtain the certificate of authorization.