Darija-LM / README.md
Duino's picture
Upload folder using huggingface_hub
2b58741 verified
|
raw
history blame
2.47 kB
---
language_model:
- causal
license: apache-2.0
tags:
- multilingual
- arabic
- darija
- transformers
- text-generation
model-index:
- name: Darija-LM
results: []
---
# Darija-LM
This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets.
## Model Description
[**TODO: Add a detailed description of your model here.**]
For example, you can include:
- Model architecture: GPT-like Transformer
- Training data: Arabic and Darija Wikipedia (20231101 snapshot)
- Tokenizer: SentencePiece (BPE, vocab size: 32000)
- Training parameters: [Specify hyperparameters like learning rate, batch size, layers, heads, etc.]
## Intended Uses & Limitations
[**TODO: Describe the intended uses and limitations of this model.**]
For example:
- Intended use cases: Text generation, research in multilingual NLP, exploring low-resource language models.
- Potential limitations: May not be suitable for production environments without further evaluation and fine-tuning, potential biases from Wikipedia data.
## How to Use
[**TODO: Add instructions on how to load and use the model.**]
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Duino/Darija-LM" # or path to your saved model locally
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example generation code (adapt as needed based on your model and tokenizer)
# input_text = "مرحبا بالعالم" # Example Arabic/Darija input
# input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
# output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
# generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
# print(generated_text)
```
## Training Details
[**TODO: Provide details about the training process.**]
- Training data preprocessing: [Describe tokenization, data splitting, etc.]
- Training procedure: [Optimizer, learning rate schedule, number of iterations, etc.]
- Hardware: [Specify GPUs or TPUs used]
## Evaluation
[**TODO: Include evaluation metrics if you have them.**]
- [Metrics and results on a validation set or benchmark.]
## Citation
[**TODO: Add citation information if applicable.**]
## Model Card Contact
[**TODO: Add your contact information.**]
- [Your name/organization]
- [Your email/website/Hugging Face profile]