shmaisanymostafa commited on
Commit
e6aa4e2
·
verified ·
1 Parent(s): b526d89

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -3
README.md CHANGED
@@ -1,3 +1,106 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - phishing-detection
4
+ - logistic-regression
5
+ - tfidf
6
+ - sklearn
7
+ - datasets
8
+ - huggingface
9
+ license: mit
10
+ ---
11
+
12
+ # Phishing Detection Model using Logistic Regression and TF-IDF
13
+
14
+ This model is a phishing detection classifier built using TF-IDF for feature extraction and Logistic Regression for classification. It processes text data to identify phishing attempts with high accuracy.
15
+
16
+ ## Model Details
17
+
18
+ - **Framework**: Scikit-learn
19
+ - **Feature Extraction**: TF-IDF Vectorizer (top 5000 features)
20
+ - **Algorithm**: Logistic Regression
21
+ - **Dataset**: [ealvaradob/phishing-dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset) (combined_reduced subset)
22
+
23
+ ## Installation
24
+
25
+ Before using the model, ensure you have the necessary dependencies installed:
26
+
27
+ ```bash
28
+ pip install scikit-learn
29
+ pip install -U "tensorflow-text==2.13.*"
30
+ pip install "tf-models-official==2.13.*"
31
+ pip uninstall -y pyarrow datasets
32
+ pip install pyarrow datasets
33
+ ```
34
+
35
+ ## How to Use
36
+
37
+ Below is an example of how to train and evaluate the model:
38
+
39
+ ```python
40
+ from datasets import load_dataset
41
+ from sklearn.feature_extraction.text import TfidfVectorizer
42
+ from sklearn.linear_model import LogisticRegression
43
+ from sklearn.model_selection import train_test_split
44
+ from sklearn.metrics import accuracy_score
45
+
46
+ # Load the dataset
47
+ dataset_reduced = load_dataset("ealvaradob/phishing-dataset", "combined_reduced", trust_remote_code=True)
48
+
49
+ # Convert to pandas DataFrame
50
+ df = dataset_reduced['train'].to_pandas()
51
+
52
+ # Extract text and labels
53
+ text = df['text'].values
54
+ labels = df['label'].values
55
+
56
+ # Split the data into train and test sets
57
+ train_text, test_text, train_labels, test_labels = train_test_split(
58
+ text, labels, test_size=0.2, random_state=42
59
+ )
60
+
61
+ # Create and fit the TF-IDF vectorizer
62
+ vectorizer = TfidfVectorizer(max_features=5000)
63
+ vectorizer.fit(train_text)
64
+
65
+ # Transform the text data into numerical features
66
+ train_features = vectorizer.transform(train_text)
67
+ test_features = vectorizer.transform(test_text)
68
+
69
+ # Create and train the logistic regression model
70
+ model = LogisticRegression()
71
+ model.fit(train_features, train_labels)
72
+
73
+ # Make predictions on the test set
74
+ predictions = model.predict(test_features)
75
+
76
+ # Evaluate the model's accuracy
77
+ accuracy = accuracy_score(test_labels, predictions)
78
+ print(f'Accuracy: {accuracy}')
79
+ ```
80
+
81
+ ## Results
82
+
83
+ - **Accuracy**: The model achieves an accuracy of `{{accuracy}}` on the test set.
84
+
85
+ ## Dataset
86
+
87
+ The dataset used for training and evaluation is the [ealvaradob/phishing-dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset). It contains a variety of phishing and non-phishing samples labeled as `1` (phishing) and `0` (non-phishing).
88
+
89
+ ## Limitations and Future Work
90
+
91
+ - The model uses a simple Logistic Regression algorithm, which may not capture complex patterns in text as effectively as deep learning models.
92
+ - Future versions could incorporate advanced NLP techniques like BERT or transformer-based models.
93
+
94
+ ## License
95
+
96
+ This project is licensed under the MIT License. Feel free to use, modify, and distribute this model as per the terms of the license.
97
+
98
+ ## Acknowledgements
99
+
100
+ - [Hugging Face Datasets](https://huggingface.co/datasets)
101
+ - [Scikit-learn](https://scikit-learn.org/)
102
+
103
+
104
+ ---
105
+ license: apache-2.0
106
+ ---