Update README.md

0c942f6 verified 5 days ago

3.94 kB

	---
	language:
	- ar
	- ary
	license: mit
	base_model: FacebookAI/xlm-roberta-large
	tags:
	- moroccan
	- darija
	- arabic
	- masked-language-modeling
	- xlm-roberta
	- natural-language-processing
	datasets:
	- atlasia/Atlaset
	library_name: transformers
	pipeline_tag: fill-mask
	widget:
	- text: "أنا كنتكلم الدارجة المغربية [MASK] مزيان."
	---

	# Model Card for atlasia/XLM-RoBERTa-Morocco

	## Model Description

	XLM-RoBERTa-Morocco is a masked language model fine-tuned specifically for Moroccan Darija (Moroccan Arabic dialect). This model is based on [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) and has been further trained on the comprehensive [Atlaset dataset](https://huggingface.co/datasets/atlasia/Atlaset), a curated collection of Moroccan Darija text.

	## Intended Uses

	This model is designed for:
	- Text classification tasks in Moroccan Darija
	- Named entity recognition in Moroccan Darija
	- Sentiment analysis of Moroccan text
	- Question answering systems for Moroccan users
	- Building embeddings for Moroccan Darija text
	- Serving as a foundation for downstream NLP tasks specific to Moroccan dialect

	## Training Details

	- Base Model: [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)
	- Training Data: [Atlaset dataset](https://huggingface.co/datasets/atlasia/Atlaset) (1.17M examples, 155M tokens)
	- Training Procedure: Fine-tuning with masked language modeling objective
	- Hyperparameters:
	- Batch size: 128
	- Learning rate: 1e-4
	- Training was optimized after testing learning rates in range {1e-4, 5e-5, 1e-5}

	## Performance

	In human evaluations conducted through the [Atlaset-Arena](https://huggingface.co/spaces/atlasia/Atlaset-Arena), this model demonstrated significant improvements over baseline models:

	\| Model \| Wins \| Total Comparisons \| Win Rate (%) \|
	\|-------\|------\|-------------------\|--------------\|
	\| atlasia/XLM-RoBERTa-Morocco \| 72 \| 120 \| 60.00 \|
	\| aubmindlab/bert-base-arabertv02 \| 63 \| 114 \| 55.26 \|
	\| SI2M-Lab/DarijaBERT \| 55 \| 119 \| 46.22 \|
	\| FacebookAI/xlm-roberta-large \| 51 \| 120 \| 42.50 \|
	\| google-bert/bert-base-multilingual-cased \| 29 \| 120 \| 24.17 \|

	The model shows a 17.5% performance improvement over the base XLM-RoBERTa-large model.

	## Limitations

	- While the model performs well on Moroccan Darija, performance may vary across different regional variations within Morocco
	- The model may not handle code-switching between Darija and other languages optimally
	- Performance on highly technical or specialized domains may be limited by the training data composition

	## Ethical Considerations

	- This model is intended to improve accessibility of NLP technologies for Moroccan Darija speakers
	- Users should be aware that the model may reflect biases present in the training data
	- The model should be further evaluated before deployment in high-stakes applications

	## How to Use

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("atlasia/XLM-RoBERTa-Morocco")
	model = AutoModelForMaskedLM.from_pretrained("atlasia/XLM-RoBERTa-Morocco")

	# Example usage for masked language modeling
	text = "أنا كنتكلم الدارجة المغربية [MASK] مزيان."
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)
	```

	## Citation

	```bibtex
	@misc{atlasia2025xlm-roberta-morocco,
	title={XLM-RoBERTa-Morocco: A Masked Language Model for Moroccan Darija},
	author={Abdelaziz Bounhar and Abdeljalil El Majjodi},
	year={2025},
	howpublished={\url{https://huggingface.co/atlasia/XLM-RoBERTa-Morocco}},
	organization={AtlasIA}
	}
	```

	## Acknowledgements

	We thank the Hugging Face team for their support and the vibrant research community behind Moroccan Darija NLP. Special thanks to all contributors of the Atlaset dataset that made this model possible.