Multilingual PII NER

A multilingual transformer model (xlm-roberta-base) fine-tuned for Named Entity Recognition (NER) to detect and mask Personally Identifiable Information (PII) in text across English, German, Italian, and French.

Model Description

  • Architecture: XLM-RoBERTa Base
  • Task: Named Entity Recognition (NER) for PII detection and masking
  • Languages: English, German, Italian, French
  • Training Data: ai4privacy/open-pii-masking-500k-ai4privacy (CoNLL format)
  • License: MIT

Intended Uses & Limitations

  • Intended use: Detect and mask PII entities in multilingual text for privacy-preserving applications.
  • Not suitable for: Use cases requiring perfect recall/precision on rare or ambiguous PII types without further fine-tuning.

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "Ar86Bat/multilang-pii-ner"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "John Doe was born on 12/12/1990 and lives in Berlin."
results = nlp(text)
print(results)

Evaluation Results

  • Overall accuracy: 99.24%
  • Macro F1-score: 0.954
  • Weighted F1-score: 0.992

Entity-level highlights

  • High F1-scores (>0.97) for common entities: AGE, BUILDINGNUM, CITY, DATE, EMAIL, GIVENNAME, STREET, TELEPHONENUM, TIME
  • Excellent performance on EMAIL and DATE (F1 β‰ˆ 0.999)
  • Lower F1-scores for challenging/rare entities: DRIVERLICENSENUM (F1 β‰ˆ 0.85), GENDER (F1 β‰ˆ 0.83), PASSPORTNUM (F1 β‰ˆ 0.88), SURNAME (F1 β‰ˆ 0.85), SEX (F1 β‰ˆ 0.84)

Training & Validation

  • Preprocessing, training, and validation scripts are available in the GitHub repository.
  • All model artifacts and outputs are in the model/ directory.
  • Training hyperparameters:
    • num_train_epochs=2 # Total number of training epochs
    • per_device_train_batch_size=32 # Batch size for training
    • per_device_eval_batch_size=32 # Batch size for evaluation

Citation

If you use this model, please cite the repository:

@misc{ar86bat_multilang_pii_ner_2025,
  author = {Arif Hizlan},
  title = {Multilingual PII NER},
  year = {2025},
  howpublished = {\\url{https://huggingface.co/Ar86Bat/multilang-pii-ner}}
}

GitHub Repository

https://github.com/Ar86Bat/multilang-pii-ner

License

MIT License

Downloads last month
37
Safetensors
Model size
277M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Ar86Bat/multilang-pii-ner

Finetuned
(3370)
this model

Dataset used to train Ar86Bat/multilang-pii-ner