|
--- |
|
language_model: |
|
- causal |
|
license: apache-2.0 |
|
tags: |
|
- multilingual |
|
- arabic |
|
- darija |
|
- transformers |
|
- text-generation |
|
model-index: |
|
- name: Darija-LM |
|
results: [] |
|
--- |
|
|
|
# Darija-LM |
|
|
|
This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets. |
|
|
|
## Model Description |
|
|
|
[**TODO: Add a detailed description of your model here.**] |
|
For example, you can include: |
|
- Model architecture: GPT-like Transformer |
|
- Training data: Arabic and Darija Wikipedia (20231101 snapshot) |
|
- Tokenizer: SentencePiece (BPE, vocab size: 32000) |
|
- Training parameters: [Specify hyperparameters like learning rate, batch size, layers, heads, etc.] |
|
|
|
## Intended Uses & Limitations |
|
|
|
[**TODO: Describe the intended uses and limitations of this model.**] |
|
For example: |
|
- Intended use cases: Text generation, research in multilingual NLP, exploring low-resource language models. |
|
- Potential limitations: May not be suitable for production environments without further evaluation and fine-tuning, potential biases from Wikipedia data. |
|
|
|
## How to Use |
|
|
|
[**TODO: Add instructions on how to load and use the model.**] |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "Duino/Darija-LM" # or path to your saved model locally |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
|
|
# Example generation code (adapt as needed based on your model and tokenizer) |
|
# input_text = "مرحبا بالعالم" # Example Arabic/Darija input |
|
# input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu") |
|
# output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True) |
|
# generated_text = tokenizer.decode(output[0], skip_special_tokens=True) |
|
# print(generated_text) |
|
``` |
|
|
|
## Training Details |
|
|
|
[**TODO: Provide details about the training process.**] |
|
- Training data preprocessing: [Describe tokenization, data splitting, etc.] |
|
- Training procedure: [Optimizer, learning rate schedule, number of iterations, etc.] |
|
- Hardware: [Specify GPUs or TPUs used] |
|
|
|
## Evaluation |
|
|
|
[**TODO: Include evaluation metrics if you have them.**] |
|
- [Metrics and results on a validation set or benchmark.] |
|
|
|
## Citation |
|
|
|
[**TODO: Add citation information if applicable.**] |
|
|
|
## Model Card Contact |
|
|
|
[**TODO: Add your contact information.**] |
|
- [Your name/organization] |
|
- [Your email/website/Hugging Face profile] |
|
|