TinyBERT Spam Classifier (Enron)

A compact TinyBERT (4-layer, 312 hidden) model fine-tuned to classify email text as spam or ham.
Trained on an Enron-derived CSV with light email-specific cleaning (e.g., removing quoted lines and base64-like blobs).
Optimized for low false positives by default; adjust the decision threshold if you want higher spam recall.

Labels: ham (0) and spam (1)

✨ Quick Start

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="prancyFox/tiny-bert-enron-spam",
    truncation=True  # recommended for long emails
)

clf("Congratulations! You won a FREE iPhone. Click here now!")
# [{'label': 'spam', 'score': 0.98}]

Batch inference

texts = [
    "Meeting moved to 3pm, see agenda attached.",
    "FREE gift card!!! Act now!",
]
preds = clf(texts, truncation=True)

🔎 Intended Use & Limitations

Intended use

Classifying email bodies (and optionally subject+body) as spam vs ham.
Low-latency scenarios where a small model is preferred.

Out of scope / Limitations

Non-English email content may reduce accuracy.
Long threads with heavy quoting/footers can dilute signal (use truncation + cleaning).
Trained on Enron-style corporate emails; consumer emails may differ (consider further fine-tuning).

🧰 How We Preprocessed the Data

Light normalization aimed at keeping semantic content:

Remove long base64-like blobs.
Drop quoted lines starting with > or |.
Optional: concatenate Subject + "\n" + Message when available.
Collapse repeated whitespace.

(You can replicate similar cleaning in your serving pipeline for alignment.)

🏋️ Training Details

Base model: huawei-noah/TinyBERT_General_4L_312D
Task: Binary text classification (ham=0, spam=1)
Tokenizer: fast BERT tokenizer (uncased)
Max length: 256 tokens
Optimizer / LR: AdamW, learning rate 2e-5 – 5e-5 (final run 3e-5)
Batch size: 32
Epochs: 4 (early stopping enabled)
Warmup: 10%
Weight decay: 0.01
Loss: Cross-entropy with class weighting (ham/spam balanced from label distribution). Focal loss available in the trainer.
Early stopping metric: eval_f1
Best checkpoint: Saved using evaluation on validation set.

Trainer script: train/train_tinybert.py (TinyBERT-compatible, with legacy HF support shims).

📊 Evaluation (Chunked Benchmark Summary)

Metrics below reflect a chunked evaluation pass (used for long emails), where the model sees up to 512 tokens per chunk with overlap. Threshold tuned to minimize false positives:

Classification Report

Class	Precision	Recall	F1
ham	0.6875	0.9973	0.8139
spam	0.9954	0.5632	0.7194
macro avg	0.8414	0.7802	0.7666

ROC-AUC: 0.9977

Confusion matrix

[[16500    45]
 [ 7500  9671]]

Interpretation: The model is conservative (very few false positives on ham). If you need to catch more spam, lower the decision threshold (e.g., from 0.5 → 0.35) or re-train with a spam-skewed class weight / focal loss.

🎛️ Threshold & Long-Email Guidance

Threshold: Default is 0.5. For higher spam recall, try 0.35–0.45 and evaluate impact on false positives.
Long emails: For multi-paragraph threads, consider chunking and aggregating chunk-level spam scores (e.g., max or average). Our reference app uses 512-token chunks with overlap.

🧪 Reproducibility

Environment

Python 3.10/3.11
transformers >= 4.40
datasets >= 2.20
evaluate >= 0.4.2
torch >= 2.1

Training command (example)

python train/train_tinybert.py \
  --train data/enron.csv \
  --text_col Message --label_col "Spam/Ham" \
  --output_dir outputs/tiny-bert-enron-spam \
  --epochs 4 --batch_size 32 --lr 3e-5 \
  --max_length 256 --fp16

Serving (FastAPI example)

python spam_bert.py --serve \
  --model prancyFox/tiny-bert-enron-spam \
  --model-cache-dir ./models_cache

📁 Files

This repo should include:

config.json
pytorch_model.bin or model.safetensors
tokenizer.json and tokenizer_config.json (or vocab.txt etc.)
README.md (this file)
(Optional) label_mapping.json with {"ham": 0, "spam": 1}

⚖️ License

Model weights & code: MIT
Dataset: Check the original Enron dataset/license terms before redistribution.

🔬 Ethical Considerations & Risks

False positives can have operational cost (missed legitimate emails). This model is tuned to minimize them; if you change the threshold, validate carefully.
Spam evolves. Periodically re-train with fresh samples to maintain accuracy.
Non-English or code-mixed content may degrade performance.

🧩 Citation

If you use this model, please cite:

@software{tinybert_enron_spam_2025,
  title        = {TinyBERT Spam Classifier (Enron)},
  author       = {Ing. Daniel Eder},
  year         = {2025},
  url          = {https://huggingface.co/prancyFox/tiny-bert-enron-spam}
}

And the TinyBERT paper:

@article{jiao2020tinybert,
  title={TinyBERT: Distilling BERT for Natural Language Understanding},
  author={Jiao, Xiaoqi and Yin, Yichun and others},
  journal={Findings of EMNLP},
  year={2020}
}

🛠 Maintainers

Daniel Eder ([email protected])

Notes

For a higher-recall variant, fine-tune with --use_focal_loss or increase the spam class weight, then re-evaluate thresholds.
If you want a PyTorch Lightning or Accelerate training variant, ~it’s easy to adapt the provided trainer.

prancyFox
/

tiny-bert-enron-spam