Duino commited on
Commit
9c9bdd1
·
verified ·
1 Parent(s): 2b58741

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +53 -24
  2. config.yaml +1 -0
README.md CHANGED
@@ -19,56 +19,85 @@ This is a multilingual language model trained on Arabic and Darija (Moroccan Ara
19
 
20
  ## Model Description
21
 
22
- [**TODO: Add a detailed description of your model here.**]
23
- For example, you can include:
24
- - Model architecture: GPT-like Transformer
25
- - Training data: Arabic and Darija Wikipedia (20231101 snapshot)
26
- - Tokenizer: SentencePiece (BPE, vocab size: 32000)
27
- - Training parameters: [Specify hyperparameters like learning rate, batch size, layers, heads, etc.]
 
 
 
 
 
 
 
28
 
29
  ## Intended Uses & Limitations
30
 
31
- [**TODO: Describe the intended uses and limitations of this model.**]
32
- For example:
33
- - Intended use cases: Text generation, research in multilingual NLP, exploring low-resource language models.
34
- - Potential limitations: May not be suitable for production environments without further evaluation and fine-tuning, potential biases from Wikipedia data.
 
 
 
35
 
36
  ## How to Use
37
 
38
- [**TODO: Add instructions on how to load and use the model.**]
 
39
  ```python
40
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
41
 
42
  model_name = "Duino/Darija-LM" # or path to your saved model locally
43
  tokenizer = AutoTokenizer.from_pretrained(model_name)
44
  model = AutoModelForCausalLM.from_pretrained(model_name)
45
 
46
- # Example generation code (adapt as needed based on your model and tokenizer)
47
- # input_text = "مرحبا بالعالم" # Example Arabic/Darija input
48
- # input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
49
- # output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
50
- # generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
51
- # print(generated_text)
 
 
 
 
 
 
 
 
 
 
 
52
  ```
53
 
 
 
54
  ## Training Details
55
 
56
- [**TODO: Provide details about the training process.**]
57
- - Training data preprocessing: [Describe tokenization, data splitting, etc.]
58
- - Training procedure: [Optimizer, learning rate schedule, number of iterations, etc.]
59
- - Hardware: [Specify GPUs or TPUs used]
 
 
 
 
60
 
61
  ## Evaluation
62
 
63
- [**TODO: Include evaluation metrics if you have them.**]
64
  - [Metrics and results on a validation set or benchmark.]
65
 
66
  ## Citation
67
 
68
- [**TODO: Add citation information if applicable.**]
69
 
70
  ## Model Card Contact
71
 
72
- [**TODO: Add your contact information.**]
73
  - [Your name/organization]
74
  - [Your email/website/Hugging Face profile]
 
19
 
20
  ## Model Description
21
 
22
+ This model is a causal language model based on a GPT-like Transformer architecture. It is trained on a combination of Arabic and Darija (Moroccan Arabic) Wikipedia datasets from the 20231101 snapshot. The model utilizes SentencePiece for tokenization with a BPE algorithm and a vocabulary size of 32000.
23
+
24
+ **Key Model Details:**
25
+ - **Architecture:** GPT-like Transformer
26
+ - **Training Data:** Arabic and Darija Wikipedia (20231101 snapshot)
27
+ - **Tokenizer:** SentencePiece (BPE, vocab size: 32000)
28
+ - **Parameters:**
29
+ - Embedding Dimension (`n_embd`): 384
30
+ - Number of Heads (`n_head`): 6
31
+ - Number of Layers (`n_layer`): 6
32
+ - Block Size (`block_size`): 256
33
+ - Dropout: 0.2
34
+ - **Training Hyperparameters:** [Specify hyperparameters like learning rate, batch size, optimizer, iterations etc. **TODO: Fill in details**]
35
 
36
  ## Intended Uses & Limitations
37
 
38
+ This model is intended for research purposes, specifically in the areas of multilingual NLP and low-resource language modeling, with a focus on Arabic and Darija. It can be used for text generation tasks and further fine-tuning on downstream applications.
39
+
40
+ **Limitations:**
41
+ - **Research Use Only:** This model is primarily for research and experimentation. It has not been rigorously evaluated for production environments.
42
+ - **Data Bias:** As it is trained on Wikipedia data, the model may exhibit biases present in the dataset.
43
+ - **Generation Quality:** The quality of generated text may vary. Further fine-tuning and evaluation are recommended for specific use cases.
44
+ - **Language Coverage:** While trained on Arabic and Darija, its performance on other languages is not guaranteed.
45
 
46
  ## How to Use
47
 
48
+ You can load and use this model using the `transformers` library from Hugging Face. Make sure you have `transformers` and `sentencepiece` installed.
49
+
50
  ```python
51
  from transformers import AutoModelForCausalLM, AutoTokenizer
52
+ import sentencepiece as spm # Ensure sentencepiece is installed
53
 
54
  model_name = "Duino/Darija-LM" # or path to your saved model locally
55
  tokenizer = AutoTokenizer.from_pretrained(model_name)
56
  model = AutoModelForCausalLM.from_pretrained(model_name)
57
 
58
+ # Example generation code:
59
+ input_text = "مرحبا بالعالم" # Example Arabic/Darija input
60
+ input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
61
+
62
+ # Generate text (adjust parameters as needed)
63
+ output = model.generate(
64
+ input_ids,
65
+ max_new_tokens=100,
66
+ do_sample=True,
67
+ temperature=0.7,
68
+ top_p=0.9,
69
+ top_k=50,
70
+ repetition_penalty=1.1
71
+ )
72
+
73
+ generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
74
+ print(generated_text)
75
  ```
76
 
77
+ **Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.
78
+
79
  ## Training Details
80
 
81
+ The model was trained using the following steps:
82
+ 1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using `datasets` library and preprocessed.
83
+ 2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
84
+ 3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
85
+ 4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
86
+ 5. **Robust Download:** Implemented retry mechanisms for robust dataset downloading.
87
+
88
+ **[TODO: Add more specific details about your training process, optimizer, learning rate schedule, hardware used, training time etc.]**
89
 
90
  ## Evaluation
91
 
92
+ **[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]**
93
  - [Metrics and results on a validation set or benchmark.]
94
 
95
  ## Citation
96
 
97
+ **[TODO: Add citation information if applicable. If you want to be cited, provide the preferred citation format.]**
98
 
99
  ## Model Card Contact
100
 
101
+ **[TODO: Add your contact information so people can reach out with questions or feedback.]**
102
  - [Your name/organization]
103
  - [Your email/website/Hugging Face profile]
config.yaml CHANGED
@@ -3,6 +3,7 @@ architectures:
3
  - GPTLanguageModel
4
  block_size: 256
5
  dropout: 0.2
 
6
  n_embd: 384
7
  n_head: 6
8
  n_layer: 6
 
3
  - GPTLanguageModel
4
  block_size: 256
5
  dropout: 0.2
6
+ model_type: gpt2
7
  n_embd: 384
8
  n_head: 6
9
  n_layer: 6