🐟 PII-RANHA: Privacy-Preserving Token Classification Model
Overview
PII-RANHA is a fine-tuned token classification model based on ModernBERT-base from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the ai4privacy/pii-masking-400k
dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more.
This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations.
Model Details
Model Architecture
- Base Model:
answerdotai/ModernBERT-base
- Task: Token Classification
- Number of Labels: 18 (17 PII categories + "O" for non-PII tokens)
Usage
Installation
To use the model, ensure you have the transformers
and datasets
libraries installed:
pip install transformers datasets
Inference Example Here’s how to load and use the model for PII detection:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load the model and tokenizer
model_name = "scampion/piiranha"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create a token classification pipeline
pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer)
# Example input
text = "My email is [email protected] and my phone number is 555-123-4567."
# Detect PII
results = pii_pipeline(text)
for entity in results:
print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")
Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445
Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657
Entity: ., Label: I-USERNAME, Score: 0.5871
Entity: do, Label: I-USERNAME, Score: 0.5350
Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399
Entity: -, Label: I-SOCIALNUM, Score: 0.5948
Entity: 123, Label: I-SOCIALNUM, Score: 0.6309
Entity: -, Label: I-SOCIALNUM, Score: 0.6151
Entity: 45, Label: I-SOCIALNUM, Score: 0.3742
Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440
Training Details
Dataset
The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens.
Training Configuration
- Batch Size: 32
- Learning Rate: 5e-6
- Epochs: 4
- Optimizer: AdamW
- Weight Decay: 0.01
- Scheduler: Linear learning rate scheduler
Evaluation Metrics
The model was evaluated using the following metrics:
- Precision
- Recall
- F1 Score
- Accuracy
Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
---|---|---|---|---|---|---|
1 | 0.026000 | 0.026693 | 0.808574 | 0.845563 | 0.826655 | 0.990215 |
2 | 0.019300 | 0.020881 | 0.849764 | 0.879042 | 0.864155 | 0.992203 |
3 | 0.016100 | 0.019111 | 0.859251 | 0.882796 | 0.870865 | 0.992912 |
4 | 0.012200 | 0.019017 | 0.860648 | 0.888844 | 0.874519 | 0.993073 |
Would you like me to help analyze any trends in these metrics?
License
This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website. For another license, contact the author.
Author
Name: Sébastien Campion
Email: [email protected]
Date: 2025-01-30
Version: 0.1
Citation
If you use this model in your work, please cite it as follows:
@misc{piiranha2025,
author = {Sébastien Campion},
title = {PII-RANHA: A Privacy-Preserving Token Classification Model},
year = {2025},
version = {0.1},
url = {https://huggingface.co/sebastien-campion/piiranha},
}
Disclaimer
This model is provided "as-is" without any guarantees of performance or suitability for specific use cases. Always evaluate the model's performance in your specific context before deployment.
- Downloads last month
- 0
Model tree for scampion/piiranha
Base model
answerdotai/ModernBERT-base