File size: 4,358 Bytes
569cdb0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 机器学习分类(Machine Learning Classification)\n",
"\n",
"此notebook以Random Forest为例 对文本进行情感分类用于展示Embedding作为文本的表征效果结果。\n",
"\n",
"主要用于展示直接使用文心Embedding作为text feature encoder进行特征提取,并应用于ML。\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"import erniebot,os,time,math\n",
"from tqdm import tqdm\n",
"from typing import List\n",
"\n",
"erniebot.api_type = 'aistudio'\n",
"erniebot.access_token = '<EB_ACCESS_TOKEN>'\n",
"\n",
"def get_embedding(word: List[str]) -> List[float]:\n",
" if len(word) <= 16:\n",
" embedding = erniebot.Embedding.create(\n",
" model = 'ernie-text-embedding',\n",
" input = word\n",
" ).get_result()\n",
" else:\n",
" size = len(word)\n",
" embedding = []\n",
" for i in tqdm(range(math.ceil(size / 16))):\n",
" embedding.extend(erniebot.Embedding.create(model = 'ernie-text-embedding', input = word[i*16:(i+1)*16]).get_result())\n",
" time.sleep(1)\n",
" return embedding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"训练数据来源为某外卖平台收集的用户评价,选取其中600条正向评论以及600条负向评论(其中100条正向评论以及100条负向评论作为测试集),引用自[Chinese NLP Corpus](https://github.com/SophonPlus/ChineseNlpCorpus)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import classification_report, accuracy_score\n",
"df = pd.read_csv('../data/delivery_reviews_1k.csv')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 63/63 [01:48<00:00, 1.72s/it]\n"
]
}
],
"source": [
"# get embedding from ernie-text-embedding\n",
"review_embedding = get_embedding(df.review.to_list())\n",
"# split the embedding into train set and test set\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" review_embedding, df.label, test_size=0.2, random_state=0\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.85 0.93 0.89 102\n",
" 1 0.92 0.83 0.87 98\n",
"\n",
" accuracy 0.88 200\n",
" macro avg 0.88 0.88 0.88 200\n",
"weighted avg 0.88 0.88 0.88 200\n",
"\n"
]
}
],
"source": [
"# train the randomforest classification model and report the result\n",
"clf = RandomForestClassifier(n_estimators=100)\n",
"clf.fit(X_train, y_train)\n",
"preds = clf.predict(X_test)\n",
"probas = clf.predict_proba(X_test)\n",
"\n",
"report = classification_report(y_test, preds)\n",
"print(report)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们可以看到模型已经学会了很好地区分这些类别。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "ernie",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
|