Update README.md
Browse files
README.md
CHANGED
@@ -1056,4 +1056,63 @@ model-index:
|
|
1056 |
---
|
1057 |
|
1058 |
|
1059 |
-
## piccolo-large-zh
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1056 |
---
|
1057 |
|
1058 |
|
1059 |
+
## piccolo-large-zh
|
1060 |
+
|
1061 |
+
piccolo是一个通用embedding模型(中文), 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程,采用了两阶段的训练方式。
|
1062 |
+
在第一阶段中,我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据),并采用二元组的softmax对比学习损失来优化模型。
|
1063 |
+
在第二阶段中,我们搜集整理了2000万人工标注的中文文本对(精标数据),并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。
|
1064 |
+
目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
|
1065 |
+
|
1066 |
+
piccolo is a general text embedding model(chinese), powered by General Model Group from SenseTime Research.
|
1067 |
+
Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
|
1068 |
+
and train the model with the pair(text and text pos) softmax contrastive loss.
|
1069 |
+
On the second stage, we collect 20 million human labeled chinese text pairs dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
|
1070 |
+
Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
|
1071 |
+
|
1072 |
+
## Metric
|
1073 |
+
我们将piccolo与其他的开源embedding模型在CMTEB榜单上进行了比较,请参考CMTEB榜单。我们在[eval文件夹](https://huggingface.co/sensenova/piccolo-base-zh/tree/main/eval)中提供了复现结果的脚本。
|
1074 |
+
|
1075 |
+
We compared the performance of the piccolo with other embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard.
|
1076 |
+
we provide scripts in ["eval" folder](https://huggingface.co/sensenova/piccolo-base-zh/tree/main/eval) for results reproducing.
|
1077 |
+
|
1078 |
+
|
1079 |
+
| Model Name | Model Size (GB) | Dimension | Sequence Length | Average (35) | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) |
|
1080 |
+
|:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
1081 |
+
| [**piccolo-large-zh**] | 0.65 | 1024 | 512 | **64.11** | 67.03 | 47.04 | 78.38 | 65.98 | 70.93 | 58.02 |
|
1082 |
+
| [bge-large-zh]| 1.3 | 1024| 512 | 63.96 | 68.32 | 48.39 | 78.94 | 65.11 | 71.52 | 54.98 |
|
1083 |
+
| [**piccolo-base-zh**]| 0.2 | 768 | 512 | **63.66** | 66.98 | 47.12 | 76.61 | 66.68 | 71.2 | 55.9 |
|
1084 |
+
| [bge-large-zh-no-instruct]| 1.3 | 1024 | 512 | 63.4 | 68.58 | 50.01 | 76.77 | 64.9 | 70.54 | 53 |
|
1085 |
+
| [bge-base-zh]| 0.41 | 768 | 512 | 62.8 | 67.07 | 47.64 | 77.5 | 64.91 | 69.53 | 54.12 |
|
1086 |
+
|
1087 |
+
## Usage
|
1088 |
+
在sentence-transformer package中可以很容易地调用piccolo模型
|
1089 |
+
```python
|
1090 |
+
# for s2s dataset, you can use piccolo as below
|
1091 |
+
# 对于短对短数据集,下面是通用的使用方式
|
1092 |
+
from sentence_transformers import SentenceTransformer
|
1093 |
+
sentences = ["数据1", "数据2"]
|
1094 |
+
model = SentenceTransformer('sensenova/piccolo-base-zh')
|
1095 |
+
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
|
1096 |
+
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
|
1097 |
+
similarity = embeddings_1 @ embeddings_2.T
|
1098 |
+
print(similarity)
|
1099 |
+
# for s2p dataset, we recommend to add instruction for passage retrieval
|
1100 |
+
# 对于短对长数据集,我们推荐添加instruction,来帮助模型更好地进行检索。
|
1101 |
+
from sentence_transformers import SentenceTransformer
|
1102 |
+
queries = ['query_1', 'query_2']
|
1103 |
+
passages = ["doc_1", "doc_2"]
|
1104 |
+
model = SentenceTransformer('sensenova/piccolo-base-zh')
|
1105 |
+
q_embeddings = model.encode(["查询:" + q for q in queries], normalize_embeddings=True)
|
1106 |
+
p_embeddings = model.encode(["结果:" + p for p in passages], normalize_embeddings=True)
|
1107 |
+
scores = q_embeddings @ p_embeddings.T
|
1108 |
+
```
|
1109 |
+
|
1110 |
+
## Training Detail
|
1111 |
+
TODO
|
1112 |
+
|
1113 |
+
## acknowledgement
|
1114 |
+
|
1115 |
+
piccolo is powered by Genral Model group from SenseTime Research.
|
1116 |
+
[Jinkin](https://huggingface.co/Jinkin) complete code implementation and model training.
|
1117 |
+
[Jinkin](https://huggingface.co/Jinkin), [CCCCxxx](https://huggingface.co/CCCCxxx) completed the data collection、processing and model evaluation together.
|
1118 |
+
Project is led by [Gaomengya](https://huggingface.co/gaomengya) and [chaorenwu111](https://huggingface.co/chaorenwu111)
|