---
language_model:
- causal
license: apache-2.0
tags:
- multilingual
- arabic
- darija
- transformers
- text-generation
language:
- ar
- ary
model-index:
- name: Darija-LM
  results: []
---

# Darija-LM

This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets.

## Model Description

This model is a causal language model based on a GPT-like Transformer architecture. It is trained on a combination of Arabic and Darija (Moroccan Arabic) Wikipedia datasets from the 20231101 snapshot. The model utilizes SentencePiece for tokenization with a BPE algorithm and a vocabulary size of 32000.

**Key Model Details:**
- **Architecture:** GPT-like Transformer
- **Training Data:** Arabic and Darija Wikipedia (20231101 snapshot)
- **Tokenizer:** SentencePiece (BPE, vocab size: 32000)
- **Parameters:**
    - Embedding Dimension (`n_embd`): 384
    - Number of Heads (`n_head`): 6
    - Number of Layers (`n_layer`): 6
    - Block Size (`block_size`): 256
    - Dropout: 0.2
- **Training Hyperparameters:** [Specify hyperparameters like learning rate, batch size, optimizer, iterations etc.  **TODO: Fill in details**]

## Intended Uses & Limitations

This model is intended for research purposes, specifically in the areas of multilingual NLP and low-resource language modeling, with a focus on Arabic and Darija. It can be used for text generation tasks and further fine-tuning on downstream applications.

**Limitations:**
- **Research Use Only:** This model is primarily for research and experimentation. It has not been rigorously evaluated for production environments.
- **Data Bias:**  As it is trained on Wikipedia data, the model may exhibit biases present in the dataset.
- **Generation Quality:** The quality of generated text may vary. Further fine-tuning and evaluation are recommended for specific use cases.
- **Language Coverage:** While trained on Arabic and Darija, its performance on other languages is not guaranteed.

## How to Use

You can load and use this model using the `transformers` library from Hugging Face. Make sure you have `transformers` and `sentencepiece` installed.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Duino/Darija-LM"  # or path to your saved model locally
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example generation code:
input_text = "مرحبا بالعالم"  # Example Arabic/Darija input
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")

# Generate text (adjust parameters as needed)
output = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
```

**Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.

## Training Details

The model was trained using the following steps:
1. **Data Streaming and Preprocessing:**  Wikipedia datasets for Arabic and Darija were streamed using the `datasets` library and preprocessed.
2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
5. **Robust Download:** Implemented retry mechanisms for robust dataset downloading.

**[TODO: Add more specific details about your training process, optimizer, learning rate schedule, hardware used, training time etc.]**

## Evaluation

**[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]**
- [Metrics and results on a validation set or benchmark.]

## Citation

**[TODO: Add citation information if applicable. If you want to be cited, provide the preferred citation format.]**

## Model Card Contact

**[TODO: Add your contact information so people can reach out with questions or feedback.]**
- [Your name/organization]
- [Your email/website/Hugging Face profile]