File size: 2,466 Bytes
2b58741 bb38df8 2b58741 bb38df8 2b58741 bb38df8 2b58741 bb38df8 2b58741 bb38df8 2b58741 bb38df8 2b58741 bb38df8 2b58741 bb38df8 2b58741 bb38df8 2b58741 bb38df8 2b58741 bb38df8 2b58741 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
language_model:
- causal
license: apache-2.0
tags:
- multilingual
- arabic
- darija
- transformers
- text-generation
model-index:
- name: Darija-LM
results: []
---
# Darija-LM
This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets.
## Model Description
[**TODO: Add a detailed description of your model here.**]
For example, you can include:
- Model architecture: GPT-like Transformer
- Training data: Arabic and Darija Wikipedia (20231101 snapshot)
- Tokenizer: SentencePiece (BPE, vocab size: 32000)
- Training parameters: [Specify hyperparameters like learning rate, batch size, layers, heads, etc.]
## Intended Uses & Limitations
[**TODO: Describe the intended uses and limitations of this model.**]
For example:
- Intended use cases: Text generation, research in multilingual NLP, exploring low-resource language models.
- Potential limitations: May not be suitable for production environments without further evaluation and fine-tuning, potential biases from Wikipedia data.
## How to Use
[**TODO: Add instructions on how to load and use the model.**]
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Duino/Darija-LM" # or path to your saved model locally
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example generation code (adapt as needed based on your model and tokenizer)
# input_text = "مرحبا بالعالم" # Example Arabic/Darija input
# input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
# output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
# generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
# print(generated_text)
```
## Training Details
[**TODO: Provide details about the training process.**]
- Training data preprocessing: [Describe tokenization, data splitting, etc.]
- Training procedure: [Optimizer, learning rate schedule, number of iterations, etc.]
- Hardware: [Specify GPUs or TPUs used]
## Evaluation
[**TODO: Include evaluation metrics if you have them.**]
- [Metrics and results on a validation set or benchmark.]
## Citation
[**TODO: Add citation information if applicable.**]
## Model Card Contact
[**TODO: Add your contact information.**]
- [Your name/organization]
- [Your email/website/Hugging Face profile]
|