|
--- |
|
language: |
|
- ar |
|
metrics: |
|
- perplexity |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
pipeline_tag: mask-generation |
|
datasets: |
|
- big_arabic_train |
|
- big_arabic_val |
|
library_name: transformers |
|
tags: |
|
- egyptian-arabic |
|
- fine-tuned |
|
- arabert |
|
license: apache-2.0 |
|
--- |
|
|
|
# EgBERT: Fine-Tuned AraBERT for Egyptian Arabic |
|
|
|
## Model Description |
|
|
|
EgBERT is a fine-tuned version of the pre-trained AraBERT model designed for Egyptian Arabic. This model was developed to enhance performance on tasks requiring understanding and generation of Egyptian dialect text, with a focus on Masked Language Modeling (MLM). The fine-tuning process involved a custom dataset containing colloquial Egyptian Arabic, making the model particularly suited for casual and conversational text. |
|
|
|
Key Features: |
|
- Based on **[aubmindlab/bert-base-arabert](https://huggingface.co/aubmindlab/bert-base-arabert)**. |
|
- Fine-tuned specifically for **Egyptian Arabic**. |
|
- Optimized for **Masked Language Modeling (MLM)** tasks. |
|
|
|
## Training Details |
|
|
|
- **Dataset**: |
|
- A custom dataset of Egyptian Arabic collected from conversational text sources. |
|
- Preprocessed to include common colloquial phrases and reduce noise in data. |
|
- **Training Setup**: |
|
- Pre-trained model: `aubmindlab/bert-base-arabert` |
|
- Fine-tuning performed for 3 epochs with a batch size of 16. |
|
- Learning rate: 2e-5. |
|
- MLM Probability: 15%. |
|
|
|
## Evaluation Results |
|
|
|
### Model Perplexity |
|
- **Baseline Model**: 36.2377 |
|
- **Fine-Tuned Model**: 26.5359 |
|
|
|
The fine-tuned model outperforms the baseline AraBERT model in terms of perplexity, indicating better performance on MLM tasks in Egyptian Arabic. |
|
|
|
## How to Use |
|
|
|
Here’s an example of how to use EgBERT in your project: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
# Load the fine-tuned model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("noortamerr/EgBERT") |
|
model = AutoModelForMaskedLM.from_pretrained("noortamerr/EgBERT") |
|
|
|
# Input text with a masked token |
|
text = "الكورة في مصر [MASK] حاجة كل الناس بتتابعها." |
|
|
|
# Tokenize and predict |
|
inputs = tokenizer(text, return_tensors="pt") |
|
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = outputs.logits |
|
|
|
# Decode the top 5 predictions for the [MASK] token |
|
mask_token_logits = predictions[0, mask_token_index, :] |
|
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() |
|
predicted_words = [tokenizer.decode([token]) for token in top_5_tokens] |
|
|
|
print(f"Predicted words: {predicted_words}") |
|
,,, |
|
|
|
|
|
@misc{EgBERT, |
|
author = {Noor Tamer, Roba Mahmoud, Orchid Hazem}, |
|
title = {EgBERT: Fine-Tuned AraBERT for Egyptian Arabic}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/noortamerr/EgBERT} |
|
} |