{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 机器学习分类（Machine Learning Classification）\n",
    "\n",
    "此notebook以Random Forest为例 对文本进行情感分类用于展示Embedding作为文本的表征效果结果。\n",
    "\n",
    "主要用于展示直接使用文心Embedding作为text feature encoder进行特征提取，并应用于ML。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "import erniebot,os,time,math\n",
    "from tqdm import tqdm\n",
    "from typing import List\n",
    "\n",
    "erniebot.api_type = 'aistudio'\n",
    "erniebot.access_token = '<EB_ACCESS_TOKEN>'\n",
    "\n",
    "def get_embedding(word: List[str]) -> List[float]:\n",
    "    if len(word) <= 16:\n",
    "        embedding = erniebot.Embedding.create(\n",
    "                                            model = 'ernie-text-embedding',\n",
    "                                            input = word\n",
    "                                            ).get_result()\n",
    "    else:\n",
    "        size = len(word)\n",
    "        embedding = []\n",
    "        for i in tqdm(range(math.ceil(size / 16))):\n",
    "            embedding.extend(erniebot.Embedding.create(model = 'ernie-text-embedding', input = word[i*16:(i+1)*16]).get_result())\n",
    "            time.sleep(1)\n",
    "    return embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "训练数据来源为某外卖平台收集的用户评价，选取其中600条正向评论以及600条负向评论（其中100条正向评论以及100条负向评论作为测试集），引用自[Chinese NLP Corpus](https://github.com/SophonPlus/ChineseNlpCorpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import classification_report, accuracy_score\n",
    "df = pd.read_csv('../data/delivery_reviews_1k.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 63/63 [01:48<00:00,  1.72s/it]\n"
     ]
    }
   ],
   "source": [
    "# get embedding from ernie-text-embedding\n",
    "review_embedding = get_embedding(df.review.to_list())\n",
    "# split the embedding into train set and test set\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    review_embedding, df.label, test_size=0.2, random_state=0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.85      0.93      0.89       102\n",
      "           1       0.92      0.83      0.87        98\n",
      "\n",
      "    accuracy                           0.88       200\n",
      "   macro avg       0.88      0.88      0.88       200\n",
      "weighted avg       0.88      0.88      0.88       200\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# train the randomforest classification model and report the result\n",
    "clf = RandomForestClassifier(n_estimators=100)\n",
    "clf.fit(X_train, y_train)\n",
    "preds = clf.predict(X_test)\n",
    "probas = clf.predict_proba(X_test)\n",
    "\n",
    "report = classification_report(y_test, preds)\n",
    "print(report)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以看到模型已经学会了很好地区分这些类别。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ernie",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}