Duino commited on
Commit
b430ca6
·
verified ·
1 Parent(s): ac684b4

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +5 -5
  2. config.json +1 -1
  3. tokenizer_config.json +1 -1
README.md CHANGED
@@ -52,9 +52,9 @@ You can load and use this model using the `transformers` library from Hugging Fa
52
 
53
  ```python
54
  from transformers import AutoModelForCausalLM, AutoTokenizer
55
- import sentencepiece as spm # Ensure sentencepiece is installed
56
 
57
- model_name = "Duino/Darija-LM" # or path to your saved model locally
58
  tokenizer = AutoTokenizer.from_pretrained(model_name)
59
  model = AutoModelForCausalLM.from_pretrained(model_name)
60
 
@@ -77,12 +77,12 @@ generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
77
  print(generated_text)
78
  ```
79
 
80
- **Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.
81
 
82
  ## Training Details
83
 
84
  The model was trained using the following steps:
85
- 1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using `datasets` library and preprocessed.
86
  2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
87
  3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
88
  4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
@@ -92,7 +92,7 @@ The model was trained using the following steps:
92
 
93
  ## Evaluation
94
 
95
- **[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]**
96
  - [Metrics and results on a validation set or benchmark.]
97
 
98
  ## Citation
 
52
 
53
  ```python
54
  from transformers import AutoModelForCausalLM, AutoTokenizer
55
+ import torch
56
 
57
+ model_name = "Duino/Darija-LM" # or path to your saved model locally
58
  tokenizer = AutoTokenizer.from_pretrained(model_name)
59
  model = AutoModelForCausalLM.from_pretrained(model_name)
60
 
 
77
  print(generated_text)
78
  ```
79
 
80
+ **Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.
81
 
82
  ## Training Details
83
 
84
  The model was trained using the following steps:
85
+ 1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using the `datasets` library and preprocessed.
86
  2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
87
  3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
88
  4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
 
92
 
93
  ## Evaluation
94
 
95
+ **[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]**
96
  - [Metrics and results on a validation set or benchmark.]
97
 
98
  ## Citation
config.json CHANGED
@@ -8,7 +8,7 @@
8
  "n_layer": 6,
9
  "block_size": 256,
10
  "dropout": 0.2,
11
- "tokenizer_class": "SentencePieceTokenizerFast",
12
  "tokenizer_file": "spm_model.model",
13
  "_name_or_path": "Duino/Darija-LM",
14
  "model_type": "gpt2"
 
8
  "n_layer": 6,
9
  "block_size": 256,
10
  "dropout": 0.2,
11
+ "tokenizer_class": "PreTrainedTokenizerFast",
12
  "tokenizer_file": "spm_model.model",
13
  "_name_or_path": "Duino/Darija-LM",
14
  "model_type": "gpt2"
tokenizer_config.json CHANGED
@@ -1,4 +1,4 @@
1
  {
2
- "tokenizer_class": "SentencePieceTokenizerFast",
3
  "model_file": "spm_model.model"
4
  }
 
1
  {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
  "model_file": "spm_model.model"
4
  }