--- language_model: - causal license: apache-2.0 tags: - multilingual - arabic - darija - transformers - text-generation language: - ar - ary model-index: - name: Darija-LM results: [] --- # Darija-LM This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets. ## Model Description This model is a causal language model based on a GPT-like Transformer architecture. It is trained on a combination of Arabic and Darija (Moroccan Arabic) Wikipedia datasets from the 20231101 snapshot. The model utilizes SentencePiece for tokenization with a BPE algorithm and a vocabulary size of 32000. **Key Model Details:** - **Architecture:** GPT-like Transformer - **Training Data:** Arabic and Darija Wikipedia (20231101 snapshot) - **Tokenizer:** SentencePiece (BPE, vocab size: 32000) - **Parameters:** - Embedding Dimension (`n_embd`): 384 - Number of Heads (`n_head`): 6 - Number of Layers (`n_layer`): 6 - Block Size (`block_size`): 256 - Dropout: 0.2 - **Training Hyperparameters:** [Specify hyperparameters like learning rate, batch size, optimizer, iterations etc. **TODO: Fill in details**] ## Intended Uses & Limitations This model is intended for research purposes, specifically in the areas of multilingual NLP and low-resource language modeling, with a focus on Arabic and Darija. It can be used for text generation tasks and further fine-tuning on downstream applications. **Limitations:** - **Research Use Only:** This model is primarily for research and experimentation. It has not been rigorously evaluated for production environments. - **Data Bias:** As it is trained on Wikipedia data, the model may exhibit biases present in the dataset. - **Generation Quality:** The quality of generated text may vary. Further fine-tuning and evaluation are recommended for specific use cases. - **Language Coverage:** While trained on Arabic and Darija, its performance on other languages is not guaranteed. ## How to Use You can load and use this model using the `transformers` library from Hugging Face. Make sure you have `transformers` and `sentencepiece` installed. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "Duino/Darija-LM" # or path to your saved model locally tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Example generation code: input_text = "مرحبا بالعالم" # Example Arabic/Darija input input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu") # Generate text (adjust parameters as needed) output = model.generate( input_ids, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.9, top_k=50, repetition_penalty=1.1 ) generated_text = tokenizer.decode(output[0], skip_special_tokens=True) print(generated_text) ``` **Note on Tokenizer:** This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository. ## Training Details The model was trained using the following steps: 1. **Data Streaming and Preprocessing:** Wikipedia datasets for Arabic and Darija were streamed using the `datasets` library and preprocessed. 2. **SentencePiece Tokenization:** A SentencePiece model was trained on a sample of the Arabic Wikipedia data. 3. **Model Training:** A GPT-like Transformer model was trained from scratch using PyTorch. 4. **Memory Optimization:** Memory mapping was used to handle large datasets efficiently. 5. **Robust Download:** Implemented retry mechanisms for robust dataset downloading. **[TODO: Add more specific details about your training process, optimizer, learning rate schedule, hardware used, training time etc.]** ## Evaluation **[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]** - [Metrics and results on a validation set or benchmark.] ## Citation **[TODO: Add citation information if applicable. If you want to be cited, provide the preferred citation format.]** ## Model Card Contact **[TODO: Add your contact information so people can reach out with questions or feedback.]** - [Your name/organization] - [Your email/website/Hugging Face profile]