Upload folder using huggingface_hub
Browse files- README.md +5 -5
- config.json +1 -1
- tokenizer_config.json +1 -1
README.md
CHANGED
@@ -52,9 +52,9 @@ You can load and use this model using the `transformers` library from Hugging Fa
|
|
52 |
|
53 |
```python
|
54 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
55 |
-
import
|
56 |
|
57 |
-
model_name = "Duino/Darija-LM"
|
58 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
59 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
60 |
|
@@ -77,12 +77,12 @@ generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
|
|
77 |
print(generated_text)
|
78 |
```
|
79 |
|
80 |
-
**Note on Tokenizer:** This model uses a SentencePiece tokenizer.
|
81 |
|
82 |
## Training Details
|
83 |
|
84 |
The model was trained using the following steps:
|
85 |
-
1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using `datasets` library and preprocessed.
|
86 |
2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
|
87 |
3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
|
88 |
4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
|
@@ -92,7 +92,7 @@ The model was trained using the following steps:
|
|
92 |
|
93 |
## Evaluation
|
94 |
|
95 |
-
**[TODO: Include evaluation metrics if you have them.
|
96 |
- [Metrics and results on a validation set or benchmark.]
|
97 |
|
98 |
## Citation
|
|
|
52 |
|
53 |
```python
|
54 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
55 |
+
import torch
|
56 |
|
57 |
+
model_name = "Duino/Darija-LM" # or path to your saved model locally
|
58 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
59 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
60 |
|
|
|
77 |
print(generated_text)
|
78 |
```
|
79 |
|
80 |
+
**Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.
|
81 |
|
82 |
## Training Details
|
83 |
|
84 |
The model was trained using the following steps:
|
85 |
+
1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using the `datasets` library and preprocessed.
|
86 |
2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
|
87 |
3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
|
88 |
4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
|
|
|
92 |
|
93 |
## Evaluation
|
94 |
|
95 |
+
**[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]**
|
96 |
- [Metrics and results on a validation set or benchmark.]
|
97 |
|
98 |
## Citation
|
config.json
CHANGED
@@ -8,7 +8,7 @@
|
|
8 |
"n_layer": 6,
|
9 |
"block_size": 256,
|
10 |
"dropout": 0.2,
|
11 |
-
"tokenizer_class": "
|
12 |
"tokenizer_file": "spm_model.model",
|
13 |
"_name_or_path": "Duino/Darija-LM",
|
14 |
"model_type": "gpt2"
|
|
|
8 |
"n_layer": 6,
|
9 |
"block_size": 256,
|
10 |
"dropout": 0.2,
|
11 |
+
"tokenizer_class": "PreTrainedTokenizerFast",
|
12 |
"tokenizer_file": "spm_model.model",
|
13 |
"_name_or_path": "Duino/Darija-LM",
|
14 |
"model_type": "gpt2"
|
tokenizer_config.json
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
{
|
2 |
-
"tokenizer_class": "
|
3 |
"model_file": "spm_model.model"
|
4 |
}
|
|
|
1 |
{
|
2 |
+
"tokenizer_class": "PreTrainedTokenizerFast",
|
3 |
"model_file": "spm_model.model"
|
4 |
}
|