|
--- |
|
tags: |
|
- phishing-detection |
|
- logistic-regression |
|
- tfidf |
|
- sklearn |
|
- datasets |
|
- huggingface |
|
license: mit |
|
--- |
|
|
|
# Phishing Detection Model using Logistic Regression and TF-IDF |
|
|
|
This model is a phishing detection classifier built using TF-IDF for feature extraction and Logistic Regression for classification. It processes text data to identify phishing attempts with high accuracy. |
|
|
|
## Model Details |
|
|
|
- **Framework**: Scikit-learn |
|
- **Feature Extraction**: TF-IDF Vectorizer (top 5000 features) |
|
- **Algorithm**: Logistic Regression |
|
- **Dataset**: [ealvaradob/phishing-dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset) (combined_reduced subset) |
|
|
|
## Installation |
|
|
|
Before using the model, ensure you have the necessary dependencies installed: |
|
|
|
```bash |
|
pip install scikit-learn |
|
pip install -U "tensorflow-text==2.13.*" |
|
pip install "tf-models-official==2.13.*" |
|
pip uninstall -y pyarrow datasets |
|
pip install pyarrow datasets |
|
``` |
|
|
|
## How to Use |
|
|
|
Below is an example of how to train and evaluate the model: |
|
|
|
```python |
|
from datasets import load_dataset |
|
from sklearn.feature_extraction.text import TfidfVectorizer |
|
from sklearn.linear_model import LogisticRegression |
|
from sklearn.model_selection import train_test_split |
|
from sklearn.metrics import accuracy_score |
|
|
|
# Load the dataset |
|
dataset_reduced = load_dataset("ealvaradob/phishing-dataset", "combined_reduced", trust_remote_code=True) |
|
|
|
# Convert to pandas DataFrame |
|
df = dataset_reduced['train'].to_pandas() |
|
|
|
# Extract text and labels |
|
text = df['text'].values |
|
labels = df['label'].values |
|
|
|
# Split the data into train and test sets |
|
train_text, test_text, train_labels, test_labels = train_test_split( |
|
text, labels, test_size=0.2, random_state=42 |
|
) |
|
|
|
# Create and fit the TF-IDF vectorizer |
|
vectorizer = TfidfVectorizer(max_features=5000) |
|
vectorizer.fit(train_text) |
|
|
|
# Transform the text data into numerical features |
|
train_features = vectorizer.transform(train_text) |
|
test_features = vectorizer.transform(test_text) |
|
|
|
# Create and train the logistic regression model |
|
model = LogisticRegression() |
|
model.fit(train_features, train_labels) |
|
|
|
# Make predictions on the test set |
|
predictions = model.predict(test_features) |
|
|
|
# Evaluate the model's accuracy |
|
accuracy = accuracy_score(test_labels, predictions) |
|
print(f'Accuracy: {accuracy}') |
|
``` |
|
|
|
## Results |
|
|
|
- **Accuracy**: The model achieves an accuracy of `{{accuracy}}` on the test set. |
|
|
|
## Dataset |
|
|
|
The dataset used for training and evaluation is the [ealvaradob/phishing-dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset). It contains a variety of phishing and non-phishing samples labeled as `1` (phishing) and `0` (non-phishing). |
|
|
|
## Limitations and Future Work |
|
|
|
- The model uses a simple Logistic Regression algorithm, which may not capture complex patterns in text as effectively as deep learning models. |
|
- Future versions could incorporate advanced NLP techniques like BERT or transformer-based models. |
|
|
|
## License |
|
|
|
This project is licensed under the MIT License. Feel free to use, modify, and distribute this model as per the terms of the license. |
|
|
|
## Acknowledgements |
|
|
|
- [Hugging Face Datasets](https://huggingface.co/datasets) |
|
- [Scikit-learn](https://scikit-learn.org/) |
|
|
|
|
|
--- |
|
license: apache-2.0 |
|
--- |
|
|