# 机器学习分类(Machine Learning Classification)

此notebook以Random Forest为例 对文本进行情感分类用于展示Embedding作为文本的表征效果结果。

主要用于展示直接使用文心Embedding作为text feature encoder进行特征提取,并应用于ML。


In [16]:
import erniebot,os,time,math
from tqdm import tqdm
from typing import List

erniebot.api_type = 'aistudio'
erniebot.access_token = ''

def get_embedding(word: List[str]) -> List[float]:
 if len(word) <= 16:
 embedding = erniebot.Embedding.create(
 model = 'ernie-text-embedding',
 input = word
 ).get_result()
 else:
 size = len(word)
 embedding = []
 for i in tqdm(range(math.ceil(size / 16))):
 embedding.extend(erniebot.Embedding.create(model = 'ernie-text-embedding', input = word[i*16:(i+1)*16]).get_result())
 time.sleep(1)
 return embedding

训练数据来源为某外卖平台收集的用户评价,选取其中600条正向评论以及600条负向评论(其中100条正向评论以及100条负向评论作为测试集),引用自[Chinese NLP Corpus](https://github.com/SophonPlus/ChineseNlpCorpus)

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
df = pd.read_csv('../data/delivery_reviews_1k.csv')

In [17]:
# get embedding from ernie-text-embedding
review_embedding = get_embedding(df.review.to_list())
# split the embedding into train set and test set
X_train, X_test, y_train, y_test = train_test_split(
 review_embedding, df.label, test_size=0.2, random_state=0
)

100%|██████████| 63/63 [01:48<00:00, 1.72s/it]


In [28]:
# train the randomforest classification model and report the result
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
probas = clf.predict_proba(X_test)

report = classification_report(y_test, preds)
print(report)

 precision recall f1-score support

 0 0.85 0.93 0.89 102
 1 0.92 0.83 0.87 98

 accuracy 0.88 200
 macro avg 0.88 0.88 0.88 200
weighted avg 0.88 0.88 0.88 200



我们可以看到模型已经学会了很好地区分这些类别。