File size: 4,358 Bytes
569cdb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 机器学习分类(Machine Learning Classification)\n",
    "\n",
    "此notebook以Random Forest为例 对文本进行情感分类用于展示Embedding作为文本的表征效果结果。\n",
    "\n",
    "主要用于展示直接使用文心Embedding作为text feature encoder进行特征提取,并应用于ML。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "import erniebot,os,time,math\n",
    "from tqdm import tqdm\n",
    "from typing import List\n",
    "\n",
    "erniebot.api_type = 'aistudio'\n",
    "erniebot.access_token = '<EB_ACCESS_TOKEN>'\n",
    "\n",
    "def get_embedding(word: List[str]) -> List[float]:\n",
    "    if len(word) <= 16:\n",
    "        embedding = erniebot.Embedding.create(\n",
    "                                            model = 'ernie-text-embedding',\n",
    "                                            input = word\n",
    "                                            ).get_result()\n",
    "    else:\n",
    "        size = len(word)\n",
    "        embedding = []\n",
    "        for i in tqdm(range(math.ceil(size / 16))):\n",
    "            embedding.extend(erniebot.Embedding.create(model = 'ernie-text-embedding', input = word[i*16:(i+1)*16]).get_result())\n",
    "            time.sleep(1)\n",
    "    return embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "训练数据来源为某外卖平台收集的用户评价,选取其中600条正向评论以及600条负向评论(其中100条正向评论以及100条负向评论作为测试集),引用自[Chinese NLP Corpus](https://github.com/SophonPlus/ChineseNlpCorpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import classification_report, accuracy_score\n",
    "df = pd.read_csv('../data/delivery_reviews_1k.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 63/63 [01:48<00:00,  1.72s/it]\n"
     ]
    }
   ],
   "source": [
    "# get embedding from ernie-text-embedding\n",
    "review_embedding = get_embedding(df.review.to_list())\n",
    "# split the embedding into train set and test set\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    review_embedding, df.label, test_size=0.2, random_state=0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.85      0.93      0.89       102\n",
      "           1       0.92      0.83      0.87        98\n",
      "\n",
      "    accuracy                           0.88       200\n",
      "   macro avg       0.88      0.88      0.88       200\n",
      "weighted avg       0.88      0.88      0.88       200\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# train the randomforest classification model and report the result\n",
    "clf = RandomForestClassifier(n_estimators=100)\n",
    "clf.fit(X_train, y_train)\n",
    "preds = clf.predict(X_test)\n",
    "probas = clf.predict_proba(X_test)\n",
    "\n",
    "report = classification_report(y_test, preds)\n",
    "print(report)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以看到模型已经学会了很好地区分这些类别。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ernie",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}