BounharAbdelaziz's picture
Update README.md
3428d1e verified
metadata
language:
  - ar
base_model:
  - BounharAbdelaziz/ModernBERT-Morocco
pipeline_tag: feature-extraction

Morocco-Darija-Sentence-Embedding

A sentence embedding model specifically trained for the Moroccan Darija dialect, built using Sentence Transformers and optimized with MatryoshkaLoss for flexible-dimensional embeddings.

Model Architecture

The model was developed in two stages:

  1. Pre-training a Masked Language Model (MLM) on the AL-Atlas Moroccan Darija Pretraining Dataset
  2. Fine-tuning using Sentence Transformers with a combination of losses:
    • CoSENTLoss
    • MultipleNegativesRankingLoss
    • MatryoshkaLoss with dimensions: [32, 64, 128, 256, 512, 1024]

This architecture allows for flexible-dimensional embeddings while maintaining semantic quality across different dimensionality requirements.

Training Data

Pre-training Dataset

The initial MLM was trained on the AL-Atlas Moroccan Darija Pretraining Dataset, which includes a comprehensive collection of Moroccan Darija text.

Sentence Embedding Training

The sentence embeddings were trained using the Sentence-Transformers-Morocco-Darija Dataset, specifically curated for semantic similarity tasks in Darija.

Training Hyperparameters

batch_size: 32
learning_rate: 2e-5
epochs: 2
warmup_steps: 0.05
gradient_accumulation_steps: 1
max_gradient_norm: 1.0

Key Features

  • Flexible embedding dimensions (32 to 1024) using MatryoshkaLoss
  • Optimized for Moroccan Darija text
  • Maximum sequence length: 512 tokens
  • Handles common Darija expressions and colloquialisms

Usage

from sentence_transformers import SentenceTransformer
import torch

# Load the model
model = SentenceTransformer('BounharAbdelaziz/Morocco-Darija-Sentence-Embedding-v0.1')

# Generate embeddings
text = "شكون هو اللي اخترع..."
embedding = model.encode(text)

# For specific dimension (e.g., 256)
embedding_256 = model.encode(
  text,
  convert_to_tensor=True, 
  output_value='token_embeddings')[:, :256] # truncate to first 256 dimensions

Model Performance

Details coming soon...

Limitations

  • Performance varies with embedding dimension selection
  • Limited handling of very region-specific Darija variants
  • May not perform optimally on highly technical or formal content
  • Performance varies if input are in Arabizi (arabic with lattin scripts)

Citation

If you use this model in your research, please cite:

@misc{morocco-darija-embedding,
  title={Morocco-Darija-Sentence-Embedding: A Neural Language Model for Moroccan Dialect},
  year={2024},
  author={[Abdelaziz Bounhar, Abdeljalil El Majjodi]},
  howpublished={https://huggingface.co/BounharAbdelaziz/Morocco-Darija-Sentence-Embedding-v0.1},
}

Contributing

Contributions are always welcome!