{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 机器学习分类(Machine Learning Classification)\n", "\n", "此notebook以Random Forest为例 对文本进行情感分类用于展示Embedding作为文本的表征效果结果。\n", "\n", "主要用于展示直接使用文心Embedding作为text feature encoder进行特征提取,并应用于ML。\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import erniebot,os,time,math\n", "from tqdm import tqdm\n", "from typing import List\n", "\n", "erniebot.api_type = 'aistudio'\n", "erniebot.access_token = ''\n", "\n", "def get_embedding(word: List[str]) -> List[float]:\n", " if len(word) <= 16:\n", " embedding = erniebot.Embedding.create(\n", " model = 'ernie-text-embedding',\n", " input = word\n", " ).get_result()\n", " else:\n", " size = len(word)\n", " embedding = []\n", " for i in tqdm(range(math.ceil(size / 16))):\n", " embedding.extend(erniebot.Embedding.create(model = 'ernie-text-embedding', input = word[i*16:(i+1)*16]).get_result())\n", " time.sleep(1)\n", " return embedding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "训练数据来源为某外卖平台收集的用户评价,选取其中600条正向评论以及600条负向评论(其中100条正向评论以及100条负向评论作为测试集),引用自[Chinese NLP Corpus](https://github.com/SophonPlus/ChineseNlpCorpus)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import classification_report, accuracy_score\n", "df = pd.read_csv('../data/delivery_reviews_1k.csv')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 63/63 [01:48<00:00, 1.72s/it]\n" ] } ], "source": [ "# get embedding from ernie-text-embedding\n", "review_embedding = get_embedding(df.review.to_list())\n", "# split the embedding into train set and test set\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " review_embedding, df.label, test_size=0.2, random_state=0\n", ")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.85 0.93 0.89 102\n", " 1 0.92 0.83 0.87 98\n", "\n", " accuracy 0.88 200\n", " macro avg 0.88 0.88 0.88 200\n", "weighted avg 0.88 0.88 0.88 200\n", "\n" ] } ], "source": [ "# train the randomforest classification model and report the result\n", "clf = RandomForestClassifier(n_estimators=100)\n", "clf.fit(X_train, y_train)\n", "preds = clf.predict(X_test)\n", "probas = clf.predict_proba(X_test)\n", "\n", "report = classification_report(y_test, preds)\n", "print(report)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们可以看到模型已经学会了很好地区分这些类别。" ] } ], "metadata": { "kernelspec": { "display_name": "ernie", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }