noorrtamerr
/

EGYbert

Mask Generation

egyptian-arabic

Inference Endpoints

Model card Files Files and versions Community

EGYbert / README.md

noorrtamerr's picture

Update README.md

11dae5a verified 3 months ago

|

history blame contribute delete

3.17 kB

	---
	language:
	- ar # Arabic
	metrics:
	- perplexity # Metric used to evaluate the model
	base_model:
	- google-bert/bert-base-uncased # The original base model used
	pipeline_tag: mask-generation # The task this model performs
	datasets:
	- big_arabic_train # Dataset used for training
	- big_arabic_val # Dataset used for validation
	library_name: transformers # Framework used (Hugging Face Transformers)
	tags:
	- egyptian-arabic # Add relevant tags to describe your model
	- fine-tuned
	- arabert
	license: apache-2.0 # Add a license (choose one appropriate for your work)
	---

	# EgBERT: Fine-Tuned AraBERT for Egyptian Arabic

	## Model Description

	EgBERT is a fine-tuned version of the pre-trained AraBERT model designed for Egyptian Arabic. This model was developed to enhance performance on tasks requiring understanding and generation of Egyptian dialect text, with a focus on Masked Language Modeling (MLM). The fine-tuning process involved a custom dataset containing colloquial Egyptian Arabic, making the model particularly suited for casual and conversational text.

	Key Features:
	- Based on [aubmindlab/bert-base-arabert](https://huggingface.co/aubmindlab/bert-base-arabert).
	- Fine-tuned specifically for Egyptian Arabic.
	- Optimized for Masked Language Modeling (MLM) tasks.

	## Training Details

	- Dataset:
	- A custom dataset of Egyptian Arabic collected from conversational text sources.
	- Preprocessed to include common colloquial phrases and reduce noise in data.
	- Training Setup:
	- Pre-trained model: `aubmindlab/bert-base-arabert`
	- Fine-tuning performed for 3 epochs with a batch size of 16.
	- Learning rate: 2e-5.
	- MLM Probability: 15%.

	## Evaluation Results

	### Model Perplexity
	- Baseline Model: 36.2377
	- Fine-Tuned Model: 26.5359

	The fine-tuned model outperforms the baseline AraBERT model in terms of perplexity, indicating better performance on MLM tasks in Egyptian Arabic.

	## How to Use

	Here’s an example of how to use EgBERT in your project:

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	# Load the fine-tuned model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("noortamerr/EgBERT")
	model = AutoModelForMaskedLM.from_pretrained("noortamerr/EgBERT")

	# Input text with a masked token
	text = "الكورة في مصر [MASK] حاجة كل الناس بتتابعها."

	# Tokenize and predict
	inputs = tokenizer(text, return_tensors="pt")
	mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = outputs.logits

	# Decode the top 5 predictions for the [MASK] token
	mask_token_logits = predictions[0, mask_token_index, :]
	top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
	predicted_words = [tokenizer.decode([token]) for token in top_5_tokens]

	print(f"Predicted words: {predicted_words}")
	,,,


	@misc{EgBERT,
	author = {Noor Tamer, Roba Mahmoud, Orchid Hazem},
	title = {EgBERT: Fine-Tuned AraBERT for Egyptian Arabic},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/noortamerr/EgBERT}
	}