shmaisanymostafa
/

Phishing-Detection

phishing-detection

logistic-regression

Model card Files Files and versions Community

Phishing-Detection / README.md

shmaisanymostafa's picture

shmaisanymostafa

Update README.md

e6aa4e2 verified about 1 month ago

|

history blame contribute delete

3.24 kB

	---
	tags:
	- phishing-detection
	- logistic-regression
	- tfidf
	- sklearn
	- datasets
	- huggingface
	license: mit
	---

	# Phishing Detection Model using Logistic Regression and TF-IDF

	This model is a phishing detection classifier built using TF-IDF for feature extraction and Logistic Regression for classification. It processes text data to identify phishing attempts with high accuracy.

	## Model Details

	- Framework: Scikit-learn
	- Feature Extraction: TF-IDF Vectorizer (top 5000 features)
	- Algorithm: Logistic Regression
	- Dataset: [ealvaradob/phishing-dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset) (combined_reduced subset)

	## Installation

	Before using the model, ensure you have the necessary dependencies installed:

	```bash
	pip install scikit-learn
	pip install -U "tensorflow-text==2.13.*"
	pip install "tf-models-official==2.13.*"
	pip uninstall -y pyarrow datasets
	pip install pyarrow datasets
	```

	## How to Use

	Below is an example of how to train and evaluate the model:

	```python
	from datasets import load_dataset
	from sklearn.feature_extraction.text import TfidfVectorizer
	from sklearn.linear_model import LogisticRegression
	from sklearn.model_selection import train_test_split
	from sklearn.metrics import accuracy_score

	# Load the dataset
	dataset_reduced = load_dataset("ealvaradob/phishing-dataset", "combined_reduced", trust_remote_code=True)

	# Convert to pandas DataFrame
	df = dataset_reduced['train'].to_pandas()

	# Extract text and labels
	text = df['text'].values
	labels = df['label'].values

	# Split the data into train and test sets
	train_text, test_text, train_labels, test_labels = train_test_split(
	text, labels, test_size=0.2, random_state=42
	)

	# Create and fit the TF-IDF vectorizer
	vectorizer = TfidfVectorizer(max_features=5000)
	vectorizer.fit(train_text)

	# Transform the text data into numerical features
	train_features = vectorizer.transform(train_text)
	test_features = vectorizer.transform(test_text)

	# Create and train the logistic regression model
	model = LogisticRegression()
	model.fit(train_features, train_labels)

	# Make predictions on the test set
	predictions = model.predict(test_features)

	# Evaluate the model's accuracy
	accuracy = accuracy_score(test_labels, predictions)
	print(f'Accuracy: {accuracy}')
	```

	## Results

	- Accuracy: The model achieves an accuracy of `{{accuracy}}` on the test set.

	## Dataset

	The dataset used for training and evaluation is the [ealvaradob/phishing-dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset). It contains a variety of phishing and non-phishing samples labeled as `1` (phishing) and `0` (non-phishing).

	## Limitations and Future Work

	- The model uses a simple Logistic Regression algorithm, which may not capture complex patterns in text as effectively as deep learning models.
	- Future versions could incorporate advanced NLP techniques like BERT or transformer-based models.

	## License

	This project is licensed under the MIT License. Feel free to use, modify, and distribute this model as per the terms of the license.

	## Acknowledgements

	- [Hugging Face Datasets](https://huggingface.co/datasets)
	- [Scikit-learn](https://scikit-learn.org/)


	---
	license: apache-2.0
	---