| # SVM Model with TF-IDF | |
| This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction: | |
| ## Start: | |
| <br>Open your terminal. | |
| <br> Clone the repo by using the following command: | |
| ``` | |
| git clone https://huggingface.co/CIS5190abcd/svm | |
| ``` | |
| <br> Go to the svm directory using following command: | |
| ``` | |
| cd svm | |
| ``` | |
| <br> Run ```ls``` to check the files inside svm folder. Make sure ```tfidf.py```, ```svm.py``` and ```data_cleaning.py``` are existing in this directory. If not, run the folloing commands: | |
| ``` | |
| git checkout origin/main -- tfidf.py | |
| git checkout origin/main -- svm.py | |
| git checkout origin/main -- data_cleaning.py | |
| ``` | |
| <br> Rerun ```ls```, double check all the required files(```tfidf.py```, ```svm.py``` and ```data_cleaning.py```) are existing. Should look like this: | |
|  | |
| <br> keep inside the svm directory until ends. | |
| ## Installation | |
| <br>Before running the code, ensure you have all the required libraries installed: | |
| ```python | |
| pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub | |
| ``` | |
| <br> Go to Python which can be opened directly in terminal by typing the following command: | |
| ``` | |
| python | |
| ``` | |
| <br> Download necessary NTLK resources for preprocessing. | |
| ```python | |
| import nltk | |
| nltk.download('stopwords') | |
| nltk.download('wordnet') | |
| nltk.download('omw-1.4') | |
| ``` | |
| <br> After downloading all the required packages, **do not** exit. | |
| ## How to use: | |
| Training a new dataset with existing SVM model, follow the steps below: | |
| - Clean the Dataset | |
| ```python | |
| from data_cleaning import clean | |
| import pandas as pd | |
| import nltk | |
| nltk.download('stopwords') | |
| ``` | |
| <br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```. | |
| ```python | |
| df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv") | |
| cleaned_df = clean(df) | |
| ``` | |
| - Extract TF-IDF Features | |
| ```python | |
| from tfidf import tfidf | |
| X_new_tfidf = tfidf.transform(cleaned_df['title']) | |
| ``` | |
| - Make Predictions | |
| ```python | |
| from svm import svm_model | |
| ``` | |
| ```exit()``` if you want to leave python. | |
| ```cd ..```if you want to exit svm directory. | |