Upload folder using huggingface_hub
Browse files- README.md +53 -24
- config.yaml +1 -0
README.md
CHANGED
@@ -19,56 +19,85 @@ This is a multilingual language model trained on Arabic and Darija (Moroccan Ara
|
|
19 |
|
20 |
## Model Description
|
21 |
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
-
|
26 |
-
-
|
27 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
## Intended Uses & Limitations
|
30 |
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
-
|
|
|
|
|
|
|
35 |
|
36 |
## How to Use
|
37 |
|
38 |
-
|
|
|
39 |
```python
|
40 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
41 |
|
42 |
model_name = "Duino/Darija-LM" # or path to your saved model locally
|
43 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
44 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
45 |
|
46 |
-
# Example generation code
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
#
|
51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
```
|
53 |
|
|
|
|
|
54 |
## Training Details
|
55 |
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
-
|
|
|
|
|
|
|
|
|
60 |
|
61 |
## Evaluation
|
62 |
|
63 |
-
[
|
64 |
- [Metrics and results on a validation set or benchmark.]
|
65 |
|
66 |
## Citation
|
67 |
|
68 |
-
[
|
69 |
|
70 |
## Model Card Contact
|
71 |
|
72 |
-
[
|
73 |
- [Your name/organization]
|
74 |
- [Your email/website/Hugging Face profile]
|
|
|
19 |
|
20 |
## Model Description
|
21 |
|
22 |
+
This model is a causal language model based on a GPT-like Transformer architecture. It is trained on a combination of Arabic and Darija (Moroccan Arabic) Wikipedia datasets from the 20231101 snapshot. The model utilizes SentencePiece for tokenization with a BPE algorithm and a vocabulary size of 32000.
|
23 |
+
|
24 |
+
**Key Model Details:**
|
25 |
+
- **Architecture:** GPT-like Transformer
|
26 |
+
- **Training Data:** Arabic and Darija Wikipedia (20231101 snapshot)
|
27 |
+
- **Tokenizer:** SentencePiece (BPE, vocab size: 32000)
|
28 |
+
- **Parameters:**
|
29 |
+
- Embedding Dimension (`n_embd`): 384
|
30 |
+
- Number of Heads (`n_head`): 6
|
31 |
+
- Number of Layers (`n_layer`): 6
|
32 |
+
- Block Size (`block_size`): 256
|
33 |
+
- Dropout: 0.2
|
34 |
+
- **Training Hyperparameters:** [Specify hyperparameters like learning rate, batch size, optimizer, iterations etc. **TODO: Fill in details**]
|
35 |
|
36 |
## Intended Uses & Limitations
|
37 |
|
38 |
+
This model is intended for research purposes, specifically in the areas of multilingual NLP and low-resource language modeling, with a focus on Arabic and Darija. It can be used for text generation tasks and further fine-tuning on downstream applications.
|
39 |
+
|
40 |
+
**Limitations:**
|
41 |
+
- **Research Use Only:** This model is primarily for research and experimentation. It has not been rigorously evaluated for production environments.
|
42 |
+
- **Data Bias:** As it is trained on Wikipedia data, the model may exhibit biases present in the dataset.
|
43 |
+
- **Generation Quality:** The quality of generated text may vary. Further fine-tuning and evaluation are recommended for specific use cases.
|
44 |
+
- **Language Coverage:** While trained on Arabic and Darija, its performance on other languages is not guaranteed.
|
45 |
|
46 |
## How to Use
|
47 |
|
48 |
+
You can load and use this model using the `transformers` library from Hugging Face. Make sure you have `transformers` and `sentencepiece` installed.
|
49 |
+
|
50 |
```python
|
51 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
52 |
+
import sentencepiece as spm # Ensure sentencepiece is installed
|
53 |
|
54 |
model_name = "Duino/Darija-LM" # or path to your saved model locally
|
55 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
56 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
57 |
|
58 |
+
# Example generation code:
|
59 |
+
input_text = "مرحبا بالعالم" # Example Arabic/Darija input
|
60 |
+
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
|
61 |
+
|
62 |
+
# Generate text (adjust parameters as needed)
|
63 |
+
output = model.generate(
|
64 |
+
input_ids,
|
65 |
+
max_new_tokens=100,
|
66 |
+
do_sample=True,
|
67 |
+
temperature=0.7,
|
68 |
+
top_p=0.9,
|
69 |
+
top_k=50,
|
70 |
+
repetition_penalty=1.1
|
71 |
+
)
|
72 |
+
|
73 |
+
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
|
74 |
+
print(generated_text)
|
75 |
```
|
76 |
|
77 |
+
**Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.
|
78 |
+
|
79 |
## Training Details
|
80 |
|
81 |
+
The model was trained using the following steps:
|
82 |
+
1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using `datasets` library and preprocessed.
|
83 |
+
2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
|
84 |
+
3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
|
85 |
+
4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
|
86 |
+
5. **Robust Download:** Implemented retry mechanisms for robust dataset downloading.
|
87 |
+
|
88 |
+
**[TODO: Add more specific details about your training process, optimizer, learning rate schedule, hardware used, training time etc.]**
|
89 |
|
90 |
## Evaluation
|
91 |
|
92 |
+
**[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]**
|
93 |
- [Metrics and results on a validation set or benchmark.]
|
94 |
|
95 |
## Citation
|
96 |
|
97 |
+
**[TODO: Add citation information if applicable. If you want to be cited, provide the preferred citation format.]**
|
98 |
|
99 |
## Model Card Contact
|
100 |
|
101 |
+
**[TODO: Add your contact information so people can reach out with questions or feedback.]**
|
102 |
- [Your name/organization]
|
103 |
- [Your email/website/Hugging Face profile]
|
config.yaml
CHANGED
@@ -3,6 +3,7 @@ architectures:
|
|
3 |
- GPTLanguageModel
|
4 |
block_size: 256
|
5 |
dropout: 0.2
|
|
|
6 |
n_embd: 384
|
7 |
n_head: 6
|
8 |
n_layer: 6
|
|
|
3 |
- GPTLanguageModel
|
4 |
block_size: 256
|
5 |
dropout: 0.2
|
6 |
+
model_type: gpt2
|
7 |
n_embd: 384
|
8 |
n_head: 6
|
9 |
n_layer: 6
|