|
--- |
|
language_model: |
|
- causal |
|
license: apache-2.0 |
|
tags: |
|
- multilingual |
|
- arabic |
|
- darija |
|
- transformers |
|
- text-generation |
|
model-index: |
|
- name: Darija-LM |
|
results: [] |
|
--- |
|
|
|
# Darija-LM |
|
|
|
This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets. |
|
|
|
## Model Description |
|
|
|
This model is a causal language model based on a GPT-like Transformer architecture. It is trained on a combination of Arabic and Darija (Moroccan Arabic) Wikipedia datasets from the 20231101 snapshot. The model utilizes SentencePiece for tokenization with a BPE algorithm and a vocabulary size of 32000. |
|
|
|
**Key Model Details:** |
|
- **Architecture:** GPT-like Transformer |
|
- **Training Data:** Arabic and Darija Wikipedia (20231101 snapshot) |
|
- **Tokenizer:** SentencePiece (BPE, vocab size: 32000) |
|
- **Parameters:** |
|
- Embedding Dimension (`n_embd`): 384 |
|
- Number of Heads (`n_head`): 6 |
|
- Number of Layers (`n_layer`): 6 |
|
- Block Size (`block_size`): 256 |
|
- Dropout: 0.2 |
|
- **Training Hyperparameters:** [Specify hyperparameters like learning rate, batch size, optimizer, iterations etc. **TODO: Fill in details**] |
|
|
|
## Intended Uses & Limitations |
|
|
|
This model is intended for research purposes, specifically in the areas of multilingual NLP and low-resource language modeling, with a focus on Arabic and Darija. It can be used for text generation tasks and further fine-tuning on downstream applications. |
|
|
|
**Limitations:** |
|
- **Research Use Only:** This model is primarily for research and experimentation. It has not been rigorously evaluated for production environments. |
|
- **Data Bias:** As it is trained on Wikipedia data, the model may exhibit biases present in the dataset. |
|
- **Generation Quality:** The quality of generated text may vary. Further fine-tuning and evaluation are recommended for specific use cases. |
|
- **Language Coverage:** While trained on Arabic and Darija, its performance on other languages is not guaranteed. |
|
|
|
## How to Use |
|
|
|
You can load and use this model using the `transformers` library from Hugging Face. Make sure you have `transformers` and `sentencepiece` installed. |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import sentencepiece as spm # Ensure sentencepiece is installed |
|
|
|
model_name = "Duino/Darija-LM" # or path to your saved model locally |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
|
|
# Example generation code: |
|
input_text = "مرحبا بالعالم" # Example Arabic/Darija input |
|
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
# Generate text (adjust parameters as needed) |
|
output = model.generate( |
|
input_ids, |
|
max_new_tokens=100, |
|
do_sample=True, |
|
temperature=0.7, |
|
top_p=0.9, |
|
top_k=50, |
|
repetition_penalty=1.1 |
|
) |
|
|
|
generated_text = tokenizer.decode(output[0], skip_special_tokens=True) |
|
print(generated_text) |
|
``` |
|
|
|
**Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository. |
|
|
|
## Training Details |
|
|
|
The model was trained using the following steps: |
|
1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using `datasets` library and preprocessed. |
|
2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data. |
|
3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch. |
|
4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently. |
|
5. **Robust Download:** Implemented retry mechanisms for robust dataset downloading. |
|
|
|
**[TODO: Add more specific details about your training process, optimizer, learning rate schedule, hardware used, training time etc.]** |
|
|
|
## Evaluation |
|
|
|
**[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]** |
|
- [Metrics and results on a validation set or benchmark.] |
|
|
|
## Citation |
|
|
|
**[TODO: Add citation information if applicable. If you want to be cited, provide the preferred citation format.]** |
|
|
|
## Model Card Contact |
|
|
|
**[TODO: Add your contact information so people can reach out with questions or feedback.]** |
|
- [Your name/organization] |
|
- [Your email/website/Hugging Face profile] |
|
|