🐟 PII-RANHA: Privacy-Preserving Token Classification Model

Overview

PII-RANHA is a fine-tuned token classification model based on ModernBERT-base from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the ai4privacy/pii-masking-400k dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more.

This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations.

Model Details

Model Architecture

Base Model: answerdotai/ModernBERT-base
Task: Token Classification
Number of Labels: 18 (17 PII categories + "O" for non-PII tokens)

Usage

Installation

To use the model, ensure you have the transformers and datasets libraries installed:

pip install transformers datasets

Inference Example Here’s how to load and use the model for PII detection:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load the model and tokenizer
model_name = "scampion/piiranha"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create a token classification pipeline
pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer)

# Example input
text = "My email is [email protected] and my phone number is 555-123-4567."

# Detect PII
results = pii_pipeline(text)
for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")

Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445
Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657
Entity: ., Label: I-USERNAME, Score: 0.5871
Entity: do, Label: I-USERNAME, Score: 0.5350
Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399
Entity: -, Label: I-SOCIALNUM, Score: 0.5948
Entity: 123, Label: I-SOCIALNUM, Score: 0.6309
Entity: -, Label: I-SOCIALNUM, Score: 0.6151
Entity: 45, Label: I-SOCIALNUM, Score: 0.3742
Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440

Training Details

Dataset

The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens.

Training Configuration

Batch Size: 32
Learning Rate: 5e-6
Epochs: 4
Optimizer: AdamW
Weight Decay: 0.01
Scheduler: Linear learning rate scheduler

Evaluation Metrics

The model was evaluated using the following metrics:

Precision
Recall
F1 Score
Accuracy

Epoch	Training Loss	Validation Loss	Precision	Recall	F1	Accuracy
1	0.026000	0.026693	0.808574	0.845563	0.826655	0.990215
2	0.019300	0.020881	0.849764	0.879042	0.864155	0.992203
3	0.016100	0.019111	0.859251	0.882796	0.870865	0.992912
4	0.012200	0.019017	0.860648	0.888844	0.874519	0.993073

Would you like me to help analyze any trends in these metrics?

License

This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website. For another license, contact the author.

Author

Name: Sébastien Campion

Email: [email protected]

Date: 2025-01-30

Version: 0.1

Citation

If you use this model in your work, please cite it as follows:

@misc{piiranha2025,
  author = {Sébastien Campion},
  title = {PII-RANHA: A Privacy-Preserving Token Classification Model},
  year = {2025},
  version = {0.1},
  url = {https://huggingface.co/sebastien-campion/piiranha},
}

Disclaimer

This model is provided "as-is" without any guarantees of performance or suitability for specific use cases. Always evaluate the model's performance in your specific context before deployment.

scampion
/

piiranha