|
--- |
|
language: |
|
- ar |
|
- ary |
|
license: mit |
|
base_model: FacebookAI/xlm-roberta-large |
|
tags: |
|
- moroccan |
|
- darija |
|
- arabic |
|
- masked-language-modeling |
|
- xlm-roberta |
|
- natural-language-processing |
|
datasets: |
|
- atlasia/Atlaset |
|
library_name: transformers |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: "أنا كنتكلم الدارجة المغربية [MASK] مزيان." |
|
--- |
|
|
|
# Model Card for atlasia/XLM-RoBERTa-Morocco |
|
|
|
## Model Description |
|
|
|
XLM-RoBERTa-Morocco is a masked language model fine-tuned specifically for Moroccan Darija (Moroccan Arabic dialect). This model is based on [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) and has been further trained on the comprehensive [Atlaset dataset](https://huggingface.co/datasets/atlasia/Atlaset), a curated collection of Moroccan Darija text. |
|
|
|
## Intended Uses |
|
|
|
This model is designed for: |
|
- Text classification tasks in Moroccan Darija |
|
- Named entity recognition in Moroccan Darija |
|
- Sentiment analysis of Moroccan text |
|
- Question answering systems for Moroccan users |
|
- Building embeddings for Moroccan Darija text |
|
- Serving as a foundation for downstream NLP tasks specific to Moroccan dialect |
|
|
|
## Training Details |
|
|
|
- **Base Model**: [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) |
|
- **Training Data**: [Atlaset dataset](https://huggingface.co/datasets/atlasia/Atlaset) (1.17M examples, 155M tokens) |
|
- **Training Procedure**: Fine-tuning with masked language modeling objective |
|
- **Hyperparameters**: |
|
- Batch size: 128 |
|
- Learning rate: 1e-4 |
|
- Training was optimized after testing learning rates in range {1e-4, 5e-5, 1e-5} |
|
|
|
## Performance |
|
|
|
In human evaluations conducted through the [Atlaset-Arena](https://huggingface.co/spaces/atlasia/Atlaset-Arena), this model demonstrated significant improvements over baseline models: |
|
|
|
| Model | Wins | Total Comparisons | Win Rate (%) | |
|
|-------|------|-------------------|--------------| |
|
| atlasia/XLM-RoBERTa-Morocco | 72 | 120 | 60.00 | |
|
| aubmindlab/bert-base-arabertv02 | 63 | 114 | 55.26 | |
|
| SI2M-Lab/DarijaBERT | 55 | 119 | 46.22 | |
|
| FacebookAI/xlm-roberta-large | 51 | 120 | 42.50 | |
|
| google-bert/bert-base-multilingual-cased | 29 | 120 | 24.17 | |
|
|
|
The model shows a 17.5% performance improvement over the base XLM-RoBERTa-large model. |
|
|
|
## Limitations |
|
|
|
- While the model performs well on Moroccan Darija, performance may vary across different regional variations within Morocco |
|
- The model may not handle code-switching between Darija and other languages optimally |
|
- Performance on highly technical or specialized domains may be limited by the training data composition |
|
|
|
## Ethical Considerations |
|
|
|
- This model is intended to improve accessibility of NLP technologies for Moroccan Darija speakers |
|
- Users should be aware that the model may reflect biases present in the training data |
|
- The model should be further evaluated before deployment in high-stakes applications |
|
|
|
## How to Use |
|
|
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
# Load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("atlasia/XLM-RoBERTa-Morocco") |
|
model = AutoModelForMaskedLM.from_pretrained("atlasia/XLM-RoBERTa-Morocco") |
|
|
|
# Example usage for masked language modeling |
|
text = "أنا كنتكلم الدارجة المغربية [MASK] مزيان." |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
``` |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{atlasia2025xlm-roberta-morocco, |
|
title={XLM-RoBERTa-Morocco: A Masked Language Model for Moroccan Darija}, |
|
author={Abdelaziz Bounhar and Abdeljalil El Majjodi}, |
|
year={2025}, |
|
howpublished={\url{https://huggingface.co/atlasia/XLM-RoBERTa-Morocco}}, |
|
organization={AtlasIA} |
|
} |
|
``` |
|
|
|
## Acknowledgements |
|
|
|
We thank the Hugging Face team for their support and the vibrant research community behind Moroccan Darija NLP. Special thanks to all contributors of the Atlaset dataset that made this model possible. |
|
|