--- language_model: - causal license: apache-2.0 tags: - multilingual - arabic - darija - transformers - text-generation model-index: - name: Darija-LM results: [] --- # Darija-LM This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets. ## Model Description [**TODO: Add a detailed description of your model here.**] For example, you can include: - Model architecture: GPT-like Transformer - Training data: Arabic and Darija Wikipedia (20231101 snapshot) - Tokenizer: SentencePiece (BPE, vocab size: 32000) - Training parameters: [Specify hyperparameters like learning rate, batch size, layers, heads, etc.] ## Intended Uses & Limitations [**TODO: Describe the intended uses and limitations of this model.**] For example: - Intended use cases: Text generation, research in multilingual NLP, exploring low-resource language models. - Potential limitations: May not be suitable for production environments without further evaluation and fine-tuning, potential biases from Wikipedia data. ## How to Use [**TODO: Add instructions on how to load and use the model.**] ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Duino/Darija-LM" # or path to your saved model locally tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Example generation code (adapt as needed based on your model and tokenizer) # input_text = "مرحبا بالعالم" # Example Arabic/Darija input # input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu") # output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True) # generated_text = tokenizer.decode(output[0], skip_special_tokens=True) # print(generated_text) ``` ## Training Details [**TODO: Provide details about the training process.**] - Training data preprocessing: [Describe tokenization, data splitting, etc.] - Training procedure: [Optimizer, learning rate schedule, number of iterations, etc.] - Hardware: [Specify GPUs or TPUs used] ## Evaluation [**TODO: Include evaluation metrics if you have them.**] - [Metrics and results on a validation set or benchmark.] ## Citation [**TODO: Add citation information if applicable.**] ## Model Card Contact [**TODO: Add your contact information.**] - [Your name/organization] - [Your email/website/Hugging Face profile]