File size: 2,345 Bytes
6f0d730
 
 
 
 
 
 
 
 
 
2256089
6f0d730
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1451ade
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6a2552
1451ade
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{}
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

NOTE: This is NOT our final model. This is one of the secondary models that we explored in developing our final model. The final model is in the GBTrees Repository on HuggingFace.

## Model Details
This model classifies news headlines as either NBC or Fox News.

### Model Description

<!-- Provide a longer summary of what this model is. -->



- **Developed by:** Jack Bader, Kaiyuan Wang, Pairan Xu
- **Taks:** Binary classification (NBC News vs. Fox News)
- **Preprocessing:** TF-IDF vectorization applied to the text data
- stop_words = "english"
- max_features = 1000
- **Model type:** Random Forest
- **Freamwork:** Scikit-learn
- 
#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

- Accuracy Score

### Model Evaluation
```python
import pandas as pd
import joblib
from huggingface_hub import hf_hub_download
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

# Mount to drive
from google.colab import drive
drive.mount('/content/drive')

# Load test set
test_df = pd.read_csv("/content/drive/MyDrive/test_data_random_subset.csv", encoding="Windows-1252")

# Log in w/ huggingface token
# Token can be found in repo as Token.docx
!huggingface-cli login

# Download the model
model = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "best_rf_model.pkl")

# Download the vectorizer
tfidf_vectorizer = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "tfidf_vectorizer.pkl")

# Load the model
pipeline = joblib.load(model)

# Load the vectorizer
tfidf_vectorizer = joblib.load(tfidf_vectorizer)

# Extract the headlines from the test set
X_test = test_df['title']

# Apply transformation to the headlines into numerical features
X_test_transformed = tfidf_vectorizer.transform(X_test)

# Make predictions using the pipeline
y_pred = pipeline.predict(X_test_transformed)

# Extract 'labels' as target
y_test = test_df['label']

# Print classification report
print(classification_report(y_test, y_pred))