|
--- |
|
title: Sentiment Analysis App |
|
emoji: 🚀 |
|
colorFrom: green |
|
colorTo: purple |
|
sdk: streamlit |
|
sdk_version: 1.17.0 |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
|
|
# AI Project: Finetuning Language Models - Toxic Tweets |
|
|
|
Hello! This is a project for CS-UY 4613: Artificial Intelligence. I'm providing a step-by-step instruction on finetuning language models for detecting toxic tweets. All codes are well commented. |
|
|
|
# Everthing you need to know |
|
* **HuggingFace app**: https://huggingface.co/spaces/andyqin18/sentiment-analysis-app |
|
|
|
----Code behind app: [app.py](app.py) |
|
|
|
* **Finetuned model**: https://huggingface.co/andyqin18/finetuned-bert-uncased |
|
|
|
----Code for how to finetune a language model: [finetune.ipynb](milestone3/finetune.ipynb) |
|
|
|
* **Demonstration video**: https://youtu.be/d7qUxv-MB2M |
|
|
|
* **Landing page**: https://sites.google.com/nyu.edu/aiprojectbyandyqin/home |
|
|
|
Performance of the model using [test_model.py](test_model.py) is shown below. The result is generated on 2000 randomly selected samples from [train.csv](milestone3/comp/train.csv) |
|
|
|
``` |
|
{'label_accuracy': 0.9821666666666666, |
|
'prediction_accuracy': 0.9195, |
|
'precision': 0.8263888888888888, |
|
'recall': 0.719758064516129} |
|
``` |
|
|
|
Now let's walk through the details :) |
|
|
|
# Milestone 1 - Setup |
|
|
|
This milestone includes setting up docker and creating a development environment on Windows 11. |
|
|
|
## 1. Enable WSL2 feature |
|
|
|
The Windows Subsystem for Linux (WSL) lets developers install a Linux distribution on Windows. |
|
|
|
``` |
|
wsl --install |
|
``` |
|
|
|
Ubuntu is the default distribution installed and WSL2 is the default version. |
|
After creating linux username and password, Ubuntu can be seen in Windows Terminal now. |
|
Details can be found [here](https://learn.microsoft.com/en-us/windows/wsl/install). |
|
|
|
![](milestone1/wsl2.png) |
|
|
|
## 2. Download and install the Linux kernel update package |
|
|
|
The package needs to be downloaded before installing Docker Desktop. |
|
However, this error might occur: |
|
|
|
`Error: wsl_update_x64.msi unable to run because "This update only applies to machines with the Windows Subsystem for Linux"` |
|
|
|
Solution: Opened Windows features and enabled "Windows Subsystem for Linux". |
|
Successfully ran update [package](https://docs.microsoft.com/windows/wsl/wsl2-kernel). |
|
|
|
![](milestone1/kernal_update_sol.png) |
|
|
|
## 3. Download Docker Desktop |
|
|
|
After downloading the [Docker App](https://www.docker.com/products/docker-desktop/), WSL2 based engine is automatically enabled. |
|
If not, follow [this link](https://docs.docker.com/desktop/windows/wsl/) for steps to turn on WSL2 backend. |
|
Open the app and input `docker version` in Terminal to check server running. |
|
|
|
![](milestone1/docker_version.png) |
|
Docker is ready to go. |
|
|
|
## 4. Create project container and image |
|
|
|
First we download the Ubuntu image from Docker’s library with: |
|
``` |
|
docker pull ubuntu |
|
``` |
|
We can check the available images with: |
|
``` |
|
docker image ls |
|
``` |
|
We can create a container named *AI_project* based on Ubuntu image with: |
|
``` |
|
docker run -it --name=AI_project ubuntu |
|
``` |
|
The `–it` options instruct the container to launch in interactive mode and enable a Terminal typing interface. |
|
After this, a shell is generated and we are directed to Linux Terminal within the container. |
|
`root` represents the currently logged-in user with highest privileges, and `249cf37645b4` is the container ID. |
|
|
|
![](milestone1/docker_create_container.png) |
|
|
|
## 5. Hello World! |
|
|
|
Now we can mess with the container by downloading python and pip needed for the project. |
|
First we update and upgrade packages by: (`apt` is Advanced Packaging Tool) |
|
``` |
|
apt update && apt upgrade |
|
``` |
|
Then we download python and pip with: |
|
``` |
|
apt install python3 pip |
|
``` |
|
We can confirm successful installation by checking the current version of python and pip. |
|
Then create a script file of *hello_world.py* under `root` directory, and run the script. |
|
You will see the following in VSCode and Terminal. |
|
|
|
![](milestone1/vscode.png) |
|
![](milestone1/hello_world.png) |
|
|
|
## 6. Commit changes to a new image specifically for the project |
|
|
|
After setting up the container we can commit changes to a specific project image with a tag of *milestone1* with: |
|
``` |
|
docker commit [CONTAINER] [NEW_IMAGE]:[TAG] |
|
``` |
|
Now if we check the available images there should be a new image for the project. If we list all containers we should be able to identify the one we were working on through container ID. |
|
|
|
![](milestone1/commit_to_new_image.png) |
|
|
|
The Docker Desktop app should match the image list we see on Terminal. |
|
|
|
![](milestone1/app_image_list.png) |
|
|
|
# Milestone 2 - Sentiment Analysis App w/ Pretrained Model |
|
|
|
This milestone includes creating a Streamlit app in HuggingFace for sentiment analysis. |
|
|
|
## 1. Space setup |
|
|
|
After creating a HuggingFace account, we can create our app as a space and choose Streamlit as the space SDK. |
|
|
|
![](milestone2/new_HF_space.png) |
|
|
|
Then we can go back to our Github Repo and create the following files. |
|
In order for the space to run properly, there must be at least three files in the root directory: |
|
[README.md](README.md), [app.py](app.py), and [requirements.txt](requirements.txt) |
|
|
|
Make sure the following metadata is at the top of **README.md** for HuggingFace to identify. |
|
``` |
|
--- |
|
title: Sentiment Analysis App |
|
emoji: 🚀 |
|
colorFrom: green |
|
colorTo: purple |
|
sdk: streamlit |
|
sdk_version: 1.17.0 |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
``` |
|
|
|
The **app.py** file is the main code of the app and **requirements.txt** should include all the libraries the code uses. HuggingFace will install the libraries listed before running the virtual environment |
|
|
|
|
|
## 2. Connect and sync to HuggingFace |
|
|
|
Then we go to settings of the Github Repo and create a secret token to access the new HuggingFace space. |
|
|
|
![](milestone2/HF_token.png) |
|
![](milestone2/github_token.png) |
|
|
|
Next, we need to setup a workflow in Github Actions. Click "set up a workflow yourself" and replace all the code in `main.yaml` with the following: (Replace `HF_USERNAME` and `SPACE_NAME` with our own) |
|
|
|
``` |
|
name: Sync to Hugging Face hub |
|
on: |
|
push: |
|
branches: [main] |
|
|
|
# to run this workflow manually from the Actions tab |
|
workflow_dispatch: |
|
|
|
jobs: |
|
sync-to-hub: |
|
runs-on: ubuntu-latest |
|
steps: |
|
- uses: actions/checkout@v3 |
|
with: |
|
fetch-depth: 0 |
|
lfs: true |
|
- name: Push to hub |
|
env: |
|
HF_TOKEN: ${{ secrets.HF_TOKEN }} |
|
run: git push --force https://HF_USERNAME:$HF[email protected]/spaces/HF_USERNAME/SPACE_NAME main |
|
``` |
|
The Repo is now connected and synced with HuggingFace space! |
|
|
|
## 3. Create the app |
|
|
|
Modify [app.py](app.py) so that it takes in one text and generate an analysis using one of the provided models. Details are explained in comment lines. |
|
|
|
The app should look like this: |
|
|
|
![](milestone2/app_UI.png) |
|
|
|
# Milestone 3 - Finetuning Language Models |
|
|
|
This milestone we wish to finetuning our own language model in HuggingFace for sentiment analysis. |
|
|
|
Here's the setup block that includes all modules: |
|
``` |
|
import pandas as pd |
|
import numpy as np |
|
import torch |
|
from sklearn.model_selection import train_test_split |
|
from torch.utils.data import Dataset |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer |
|
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') |
|
``` |
|
|
|
## 1. Prepare Data |
|
First we extract comment strings and labels from `train.csv` and split them into training data and validation data with a percentage of 80% vs 20%. We also create 2 dictionaries that map labels to integers and back. |
|
``` |
|
df = pd.read_csv("milestone3/comp/train.csv") |
|
|
|
train_texts = df["comment_text"].values |
|
labels = df.columns[2:] |
|
id2label = {idx:label for idx, label in enumerate(labels)} |
|
label2id = {label:idx for idx, label in enumerate(labels)} |
|
train_labels = df[labels].values |
|
|
|
|
|
# Randomly select 20000 samples within the data |
|
np.random.seed(18) |
|
small_train_texts = np.random.choice(train_texts, size=20000, replace=False) |
|
|
|
np.random.seed(18) |
|
small_train_labels_idx = np.random.choice(train_labels.shape[0], size=20000, replace=False) |
|
small_train_labels = train_labels[small_train_labels_idx, :] |
|
|
|
|
|
# Separate training data and validation data |
|
train_texts, val_texts, train_labels, val_labels = train_test_split(small_train_texts, small_train_labels, test_size=.2) |
|
``` |
|
|
|
## 2. Data Preprocessing |
|
As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. The `AutoTokenizer` will automatically load the appropriate tokenizer based on the checkpoint on the hub. We can now merge the labels and texts to datasets as a class we defined. |
|
``` |
|
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") |
|
|
|
class TextDataset(Dataset): |
|
def __init__(self,texts,labels): |
|
self.texts = texts |
|
self.labels = labels |
|
|
|
def __getitem__(self,idx): |
|
encodings = tokenizer(self.texts[idx], truncation=True, padding="max_length") |
|
item = {key: torch.tensor(val) for key, val in encodings.items()} |
|
item['labels'] = torch.tensor(self.labels[idx],dtype=torch.float32) |
|
del encodings |
|
return item |
|
|
|
def __len__(self): |
|
return len(self.labels) |
|
|
|
|
|
train_dataset = TextDataset(train_texts, train_labels) |
|
val_dataset = TextDataset(val_texts, val_labels) |
|
``` |
|
|
|
## 3. Train the model using Trainer |
|
We define a model that includes a pre-trained base and also set the problem to `multi_label_classification`. Then we train the model using `Trainer`, which requires `TrainingArguments` beforehand that specify training hyperparameters, where we can set learning rate, batch sizes and `push_to_hub=True`. |
|
|
|
After verifying Token with HuggingFace, the model is now pushed to the hub. |
|
|
|
``` |
|
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", |
|
problem_type="multi_label_classification", |
|
num_labels=len(labels), |
|
id2label=id2label, |
|
label2id=label2id) |
|
model.to(device) |
|
|
|
training_args = TrainingArguments( |
|
output_dir="finetuned-bert-uncased", |
|
evaluation_strategy = "epoch", |
|
save_strategy = "epoch", |
|
learning_rate=2e-5, |
|
per_device_train_batch_size=16, |
|
per_device_eval_batch_size=16, |
|
num_train_epochs=5, |
|
load_best_model_at_end=True, |
|
push_to_hub=True |
|
) |
|
|
|
trainer = Trainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=train_dataset, |
|
eval_dataset=val_dataset, |
|
tokenizer=tokenizer |
|
) |
|
|
|
trainer.train() |
|
trainer.push_to_hub() |
|
``` |
|
|
|
## 4. Update the app |
|
|
|
Modify [app.py](app.py) so that it takes in one text and generate an analysis using one of the provided models. If you choose my fine-tuned model, there will be 10 more examples shown. Details are explained in comment lines. |
|
|
|
The app should look like this: |
|
|
|
![](milestone3/appUI.png) |
|
|
|
## Reference: |
|
For connecting Github with HuggingFace, check this [video](https://www.youtube.com/watch?v=8hOzsFETm4I). |
|
|
|
For creating the app, check this [video](https://www.youtube.com/watch?v=GSt00_-0ncQ) |
|
|
|
The HuggingFace documentation is [here](https://huggingface.co/docs), and Streamlit APIs [here](https://docs.streamlit.io/library/api-reference). |