Darija-LM

This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets.

Model Description

This model is a causal language model based on a GPT-like Transformer architecture. It is trained on a combination of Arabic and Darija (Moroccan Arabic) Wikipedia datasets from the 20231101 snapshot. The model utilizes SentencePiece for tokenization with a BPE algorithm and a vocabulary size of 32000.

Key Model Details:

Architecture: GPT-like Transformer
Training Data: Arabic and Darija Wikipedia (20231101 snapshot)
Tokenizer: SentencePiece (BPE, vocab size: 32000)
Parameters:
- Embedding Dimension (n_embd): 384
- Number of Heads (n_head): 6
- Number of Layers (n_layer): 6
- Block Size (block_size): 256
- Dropout: 0.2
Training Hyperparameters: [Specify hyperparameters like learning rate, batch size, optimizer, iterations etc. TODO: Fill in details]

Intended Uses & Limitations

This model is intended for research purposes, specifically in the areas of multilingual NLP and low-resource language modeling, with a focus on Arabic and Darija. It can be used for text generation tasks and further fine-tuning on downstream applications.

Limitations:

Research Use Only: This model is primarily for research and experimentation. It has not been rigorously evaluated for production environments.
Data Bias: As it is trained on Wikipedia data, the model may exhibit biases present in the dataset.
Generation Quality: The quality of generated text may vary. Further fine-tuning and evaluation are recommended for specific use cases.
Language Coverage: While trained on Arabic and Darija, its performance on other languages is not guaranteed.

How to Use

You can load and use this model using the transformers library from Hugging Face. Make sure you have transformers and sentencepiece installed.

from transformers import AutoModelForCausalLM, AutoTokenizer
import sentencepiece as spm  # Ensure sentencepiece is installed

model_name = "Duino/Darija-LM" # or path to your saved model locally
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example generation code:
input_text = "مرحبا بالعالم"  # Example Arabic/Darija input
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")

# Generate text (adjust parameters as needed)
output = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

Note on Tokenizer: This model uses a SentencePiece tokenizer. When loading with transformers, it should automatically handle the SentencePiece model if it's correctly configured in the repository.

Training Details

The model was trained using the following steps:

Data Streaming and Preprocessing: Wikipedia datasets for Arabic and Darija were streamed using datasets library and preprocessed.
SentencePiece Tokenization: A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
Model Training: A GPT-like Transformer model was trained from scratch using PyTorch.
Memory Optimization: Memory mapping was used to handle large datasets efficiently.
Robust Download: Implemented retry mechanisms for robust dataset downloading.

[TODO: Add more specific details about your training process, optimizer, learning rate schedule, hardware used, training time etc.]

Evaluation

[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]

[Metrics and results on a validation set or benchmark.]

Citation

[TODO: Add citation information if applicable. If you want to be cited, provide the preferred citation format.]

Model Card Contact

[TODO: Add your contact information so people can reach out with questions or feedback.]

[Your name/organization]
[Your email/website/Hugging Face profile]

Duino
/

Darija-LM