Darija-LM / README.md
Duino's picture
Upload folder using huggingface_hub
9f84e46 verified
|
raw
history blame
4.51 kB
---
language_model:
- causal
license: apache-2.0
tags:
- multilingual
- arabic
- darija
- transformers
- text-generation
language:
- ar
- ary
model-index:
- name: Darija-LM
results: []
---
# Darija-LM
This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets.
## Model Description
This model is a causal language model based on a GPT-like Transformer architecture. It is trained on a combination of Arabic and Darija (Moroccan Arabic) Wikipedia datasets from the 20231101 snapshot. The model utilizes SentencePiece for tokenization with a BPE algorithm and a vocabulary size of 32000.
**Key Model Details:**
- **Architecture:** GPT-like Transformer
- **Training Data:** Arabic and Darija Wikipedia (20231101 snapshot)
- **Tokenizer:** SentencePiece (BPE, vocab size: 32000)
- **Parameters:**
- Embedding Dimension (`n_embd`): 384
- Number of Heads (`n_head`): 6
- Number of Layers (`n_layer`): 6
- Block Size (`block_size`): 256
- Dropout: 0.2
- **Training Hyperparameters:** [Specify hyperparameters like learning rate, batch size, optimizer, iterations etc. **TODO: Fill in details**]
## Intended Uses & Limitations
This model is intended for research purposes, specifically in the areas of multilingual NLP and low-resource language modeling, with a focus on Arabic and Darija. It can be used for text generation tasks and further fine-tuning on downstream applications.
**Limitations:**
- **Research Use Only:** This model is primarily for research and experimentation. It has not been rigorously evaluated for production environments.
- **Data Bias:** As it is trained on Wikipedia data, the model may exhibit biases present in the dataset.
- **Generation Quality:** The quality of generated text may vary. Further fine-tuning and evaluation are recommended for specific use cases.
- **Language Coverage:** While trained on Arabic and Darija, its performance on other languages is not guaranteed.
## How to Use
You can load and use this model using the `transformers` library from Hugging Face. Make sure you have `transformers` and `sentencepiece` installed.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import sentencepiece as spm # Ensure sentencepiece is installed
model_name = "Duino/Darija-LM" # or path to your saved model locally
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example generation code:
input_text = "مرحبا بالعالم" # Example Arabic/Darija input
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
# Generate text (adjust parameters as needed)
output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.1
)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
```
**Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.
## Training Details
The model was trained using the following steps:
1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using `datasets` library and preprocessed.
2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
5. **Robust Download:** Implemented retry mechanisms for robust dataset downloading.
**[TODO: Add more specific details about your training process, optimizer, learning rate schedule, hardware used, training time etc.]**
## Evaluation
**[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]**
- [Metrics and results on a validation set or benchmark.]
## Citation
**[TODO: Add citation information if applicable. If you want to be cited, provide the preferred citation format.]**
## Model Card Contact
**[TODO: Add your contact information so people can reach out with questions or feedback.]**
- [Your name/organization]
- [Your email/website/Hugging Face profile]