File size: 2,466 Bytes

2b58741
 
 
 
 
 
 
 
 
 
 
 
 
 
bb38df8
2b58741
bb38df8
2b58741
bb38df8
2b58741
bb38df8
2b58741
 
 
 
 
 
bb38df8
2b58741
bb38df8
2b58741
 
 
 
bb38df8
2b58741
 
 
bb38df8
2b58741
bb38df8
2b58741
 
 
bb38df8
2b58741
 
 
 
 
 
bb38df8
 
2b58741

---
language_model:
- causal
license: apache-2.0
tags:
- multilingual
- arabic
- darija
- transformers
- text-generation
model-index:
- name: Darija-LM
  results: []
---

# Darija-LM

This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets.

## Model Description

[**TODO: Add a detailed description of your model here.**]
For example, you can include:
- Model architecture: GPT-like Transformer
- Training data: Arabic and Darija Wikipedia (20231101 snapshot)
- Tokenizer: SentencePiece (BPE, vocab size: 32000)
- Training parameters: [Specify hyperparameters like learning rate, batch size, layers, heads, etc.]

## Intended Uses & Limitations

[**TODO: Describe the intended uses and limitations of this model.**]
For example:
- Intended use cases: Text generation, research in multilingual NLP, exploring low-resource language models.
- Potential limitations: May not be suitable for production environments without further evaluation and fine-tuning, potential biases from Wikipedia data.

## How to Use

[**TODO: Add instructions on how to load and use the model.**]
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Duino/Darija-LM" # or path to your saved model locally
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example generation code (adapt as needed based on your model and tokenizer)
# input_text = "مرحبا بالعالم" # Example Arabic/Darija input
# input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
# output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
# generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
# print(generated_text)
```

## Training Details

[**TODO: Provide details about the training process.**]
- Training data preprocessing: [Describe tokenization, data splitting, etc.]
- Training procedure: [Optimizer, learning rate schedule, number of iterations, etc.]
- Hardware: [Specify GPUs or TPUs used]

## Evaluation

[**TODO: Include evaluation metrics if you have them.**]
- [Metrics and results on a validation set or benchmark.]

## Citation

[**TODO: Add citation information if applicable.**]

## Model Card Contact

[**TODO: Add your contact information.**]
- [Your name/organization]
- [Your email/website/Hugging Face profile]