--- tags: - phishing-detection - logistic-regression - tfidf - sklearn - datasets - huggingface license: mit --- # Phishing Detection Model using Logistic Regression and TF-IDF This model is a phishing detection classifier built using TF-IDF for feature extraction and Logistic Regression for classification. It processes text data to identify phishing attempts with high accuracy. ## Model Details - **Framework**: Scikit-learn - **Feature Extraction**: TF-IDF Vectorizer (top 5000 features) - **Algorithm**: Logistic Regression - **Dataset**: [ealvaradob/phishing-dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset) (combined_reduced subset) ## Installation Before using the model, ensure you have the necessary dependencies installed: ```bash pip install scikit-learn pip install -U "tensorflow-text==2.13.*" pip install "tf-models-official==2.13.*" pip uninstall -y pyarrow datasets pip install pyarrow datasets ``` ## How to Use Below is an example of how to train and evaluate the model: ```python from datasets import load_dataset from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load the dataset dataset_reduced = load_dataset("ealvaradob/phishing-dataset", "combined_reduced", trust_remote_code=True) # Convert to pandas DataFrame df = dataset_reduced['train'].to_pandas() # Extract text and labels text = df['text'].values labels = df['label'].values # Split the data into train and test sets train_text, test_text, train_labels, test_labels = train_test_split( text, labels, test_size=0.2, random_state=42 ) # Create and fit the TF-IDF vectorizer vectorizer = TfidfVectorizer(max_features=5000) vectorizer.fit(train_text) # Transform the text data into numerical features train_features = vectorizer.transform(train_text) test_features = vectorizer.transform(test_text) # Create and train the logistic regression model model = LogisticRegression() model.fit(train_features, train_labels) # Make predictions on the test set predictions = model.predict(test_features) # Evaluate the model's accuracy accuracy = accuracy_score(test_labels, predictions) print(f'Accuracy: {accuracy}') ``` ## Results - **Accuracy**: The model achieves an accuracy of `{{accuracy}}` on the test set. ## Dataset The dataset used for training and evaluation is the [ealvaradob/phishing-dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset). It contains a variety of phishing and non-phishing samples labeled as `1` (phishing) and `0` (non-phishing). ## Limitations and Future Work - The model uses a simple Logistic Regression algorithm, which may not capture complex patterns in text as effectively as deep learning models. - Future versions could incorporate advanced NLP techniques like BERT or transformer-based models. ## License This project is licensed under the MIT License. Feel free to use, modify, and distribute this model as per the terms of the license. ## Acknowledgements - [Hugging Face Datasets](https://huggingface.co/datasets) - [Scikit-learn](https://scikit-learn.org/) --- license: apache-2.0 ---