Duino
/

Darija-LM

@@ -19,56 +19,85 @@ This is a multilingual language model trained on Arabic and Darija (Moroccan Ara
 ## Model Description
-[**TODO: Add a detailed description of your model here.**]
-For example, you can include:
-- Model architecture: GPT-like Transformer
-- Training data: Arabic and Darija Wikipedia (20231101 snapshot)
-- Tokenizer: SentencePiece (BPE, vocab size: 32000)
-- Training parameters: [Specify hyperparameters like learning rate, batch size, layers, heads, etc.]
 ## Intended Uses & Limitations
-[**TODO: Describe the intended uses and limitations of this model.**]
-For example:
-- Intended use cases: Text generation, research in multilingual NLP, exploring low-resource language models.
-- Potential limitations: May not be suitable for production environments without further evaluation and fine-tuning, potential biases from Wikipedia data.
 ## How to Use
-[**TODO: Add instructions on how to load and use the model.**]
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 model_name = "Duino/Darija-LM" # or path to your saved model locally
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)
-# Example generation code (adapt as needed based on your model and tokenizer)
-# input_text = "مرحبا بالعالم" # Example Arabic/Darija input
-# input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
-# output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
-# generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
-# print(generated_text)
 ```
 ## Training Details
-[**TODO: Provide details about the training process.**]
-- Training data preprocessing: [Describe tokenization, data splitting, etc.]
-- Training procedure: [Optimizer, learning rate schedule, number of iterations, etc.]
-- Hardware: [Specify GPUs or TPUs used]
 ## Evaluation
-[**TODO: Include evaluation metrics if you have them.**]
 - [Metrics and results on a validation set or benchmark.]
 ## Citation
-[**TODO: Add citation information if applicable.**]
 ## Model Card Contact
-[**TODO: Add your contact information.**]
 - [Your name/organization]
 - [Your email/website/Hugging Face profile]

 ## Model Description
+This model is a causal language model based on a GPT-like Transformer architecture. It is trained on a combination of Arabic and Darija (Moroccan Arabic) Wikipedia datasets from the 20231101 snapshot. The model utilizes SentencePiece for tokenization with a BPE algorithm and a vocabulary size of 32000.
+**Key Model Details:**
+- **Architecture:** GPT-like Transformer
+- **Training Data:** Arabic and Darija Wikipedia (20231101 snapshot)
+- **Tokenizer:** SentencePiece (BPE, vocab size: 32000)
+- **Parameters:**
+    - Embedding Dimension (`n_embd`): 384
+    - Number of Heads (`n_head`): 6
+    - Number of Layers (`n_layer`): 6
+    - Block Size (`block_size`): 256
+    - Dropout: 0.2
+- **Training Hyperparameters:** [Specify hyperparameters like learning rate, batch size, optimizer, iterations etc.  **TODO: Fill in details**]
 ## Intended Uses & Limitations
+This model is intended for research purposes, specifically in the areas of multilingual NLP and low-resource language modeling, with a focus on Arabic and Darija. It can be used for text generation tasks and further fine-tuning on downstream applications.
+**Limitations:**
+- **Research Use Only:** This model is primarily for research and experimentation. It has not been rigorously evaluated for production environments.
+- **Data Bias:**  As it is trained on Wikipedia data, the model may exhibit biases present in the dataset.
+- **Generation Quality:** The quality of generated text may vary. Further fine-tuning and evaluation are recommended for specific use cases.
+- **Language Coverage:** While trained on Arabic and Darija, its performance on other languages is not guaranteed.
 ## How to Use
+You can load and use this model using the `transformers` library from Hugging Face. Make sure you have `transformers` and `sentencepiece` installed.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+import sentencepiece as spm  # Ensure sentencepiece is installed
 model_name = "Duino/Darija-LM" # or path to your saved model locally
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)
+# Example generation code:
+input_text = "مرحبا بالعالم"  # Example Arabic/Darija input
+input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
+# Generate text (adjust parameters as needed)
+output = model.generate(
+    input_ids,
+    max_new_tokens=100,
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.9,
+    top_k=50,
+    repetition_penalty=1.1
+)
+generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
+print(generated_text)
 ```
+**Note on Tokenizer:** This model uses a SentencePiece tokenizer.  When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.
 ## Training Details
+The model was trained using the following steps:
+1. **Data Streaming and Preprocessing:**  Wikipedia datasets for Arabic and Darija were streamed using `datasets` library and preprocessed.
+2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
+3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch.
+4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently.
+5. **Robust Download:** Implemented retry mechanisms for robust dataset downloading.
+**[TODO: Add more specific details about your training process, optimizer, learning rate schedule, hardware used, training time etc.]**
 ## Evaluation
+**[TODO: Include evaluation metrics if you have them.  It's highly recommended to evaluate your model and add metrics here.  For example, you could calculate perplexity on a held-out validation set.]**
 - [Metrics and results on a validation set or benchmark.]
 ## Citation
+**[TODO: Add citation information if applicable. If you want to be cited, provide the preferred citation format.]**
 ## Model Card Contact
+**[TODO: Add your contact information so people can reach out with questions or feedback.]**
 - [Your name/organization]
 - [Your email/website/Hugging Face profile]

config.yaml CHANGED Viewed

@@ -3,6 +3,7 @@ architectures:
 - GPTLanguageModel
 block_size: 256
 dropout: 0.2
 n_embd: 384
 n_head: 6
 n_layer: 6

 - GPTLanguageModel
 block_size: 256
 dropout: 0.2
+model_type: gpt2
 n_embd: 384
 n_head: 6
 n_layer: 6