AventIQ-AI/Securebert-website-phishing-prediction

🔒 SecureBERT Phishing Detection Model

This repository hosts a fine-tuned SecureBERT-based model optimized for phishing URL detection using a cybersecurity dataset. The model classifies URLs as either phishing (malicious) or safe (benign).

📚 Model Details

Model Architecture: SecureBERT (Based on BERT)
Task: Binary Classification (Phishing vs. Safe)
Dataset: shashwatwork/web-page-phishing-detection-dataset (11,431 URLs, 88 features)
Framework: PyTorch & Hugging Face Transformers
Input Data: URL strings & extracted numerical features
Number of Classes: 2 (Phishing, Safe)
Quantization: FP16 (for efficiency)

🚀 Usage

Installation

pip install torch transformers scikit-learn pandas

Loading the Model

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the fine-tuned model and tokenizer
model_path = "./fine_tuned_SecureBERT"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()  # Set model to evaluation mode

print("✅ SecureBERT model loaded successfully and ready for inference!")

🔍 Perform Phishing Detection

def predict_url(url):
    # Tokenize input
    encoding = tokenizer(url, truncation=True, padding=True, max_length=512, return_tensors="pt")
    
    # Perform inference
    with torch.no_grad():
        output = model(**encoding)
    
    # Get predicted class
    predicted_class = torch.argmax(output.logits, dim=1).item()
    
    # Map label
    label = "Phishing" if predicted_class == 1 else "Safe"
    return label

# Example usage
custom_url = "http://example.com/free-gift"
prediction = predict_url(custom_url)
print(f"Predicted label: {prediction}")

📊 Evaluation Results

After fine-tuning, the model was evaluated on a test set, achieving the following performance:

Metric	Score
Accuracy	97.2%
Precision	96.8%
Recall	97.5%
F1-Score	97.1%
Inference Speed	Fast (Optimized with FP16)

🛠️ Fine-Tuning Details

Dataset

The model was trained on a shashwatwork/web-page-phishing-detection-dataset consisting of 11,431 URLs labeled as either phishing or safe. Features include URL characteristics, domain properties, and additional metadata.

Training Configuration

Number of epochs: 5
Batch size: 16
Optimizer: AdamW
Learning rate: 2e-5
Loss Function: Cross-Entropy
Evaluation Strategy: Validation at each epoch

Quantization

The model was quantized using FP16 precision, reducing latency and memory usage while maintaining high accuracy.

⚠️ Limitations

Evasion Techniques: Attackers constantly evolve phishing techniques, which may reduce model effectiveness.
Dataset Bias: The model was trained on a specific dataset; new phishing tactics may require retraining.
False Positives: Some legitimate but unusual URLs might be classified as phishing.

✅ Use this fine-tuned SecureBERT model for accurate and efficient phishing detection! 🔒🚀