Darija-LM / README.md

Upload folder using huggingface_hub

9f84e46 verified 29 days ago

4.51 kB

	---
	language_model:
	- causal
	license: apache-2.0
	tags:
	- multilingual
	- arabic
	- darija
	- transformers
	- text-generation
	language:
	- ar
	- ary
	model-index:
	- name: Darija-LM
	results: []
	---

	# Darija-LM

	This is a multilingual language model trained on Arabic and Darija (Moroccan Arabic) Wikipedia datasets.

	## Model Description

	This model is a causal language model based on a GPT-like Transformer architecture. It is trained on a combination of Arabic and Darija (Moroccan Arabic) Wikipedia datasets from the 20231101 snapshot. The model utilizes SentencePiece for tokenization with a BPE algorithm and a vocabulary size of 32000.

	Key Model Details:
	- Architecture: GPT-like Transformer
	- Training Data: Arabic and Darija Wikipedia (20231101 snapshot)
	- Tokenizer: SentencePiece (BPE, vocab size: 32000)
	- Parameters:
	- Embedding Dimension (`n_embd`): 384
	- Number of Heads (`n_head`): 6
	- Number of Layers (`n_layer`): 6
	- Block Size (`block_size`): 256
	- Dropout: 0.2
	- Training Hyperparameters: [Specify hyperparameters like learning rate, batch size, optimizer, iterations etc. TODO: Fill in details]

	## Intended Uses & Limitations

	This model is intended for research purposes, specifically in the areas of multilingual NLP and low-resource language modeling, with a focus on Arabic and Darija. It can be used for text generation tasks and further fine-tuning on downstream applications.

	Limitations:
	- Research Use Only: This model is primarily for research and experimentation. It has not been rigorously evaluated for production environments.
	- Data Bias: As it is trained on Wikipedia data, the model may exhibit biases present in the dataset.
	- Generation Quality: The quality of generated text may vary. Further fine-tuning and evaluation are recommended for specific use cases.
	- Language Coverage: While trained on Arabic and Darija, its performance on other languages is not guaranteed.

	## How to Use

	You can load and use this model using the `transformers` library from Hugging Face. Make sure you have `transformers` and `sentencepiece` installed.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import sentencepiece as spm # Ensure sentencepiece is installed

	model_name = "Duino/Darija-LM" # or path to your saved model locally
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	# Example generation code:
	input_text = "مرحبا بالعالم" # Example Arabic/Darija input
	input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")

	# Generate text (adjust parameters as needed)
	output = model.generate(
	input_ids,
	max_new_tokens=100,
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	top_k=50,
	repetition_penalty=1.1
	)

	generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
	print(generated_text)
	```

	Note on Tokenizer: This model uses a SentencePiece tokenizer. When loading with `transformers`, it should automatically handle the SentencePiece model if it's correctly configured in the repository.

	## Training Details

	The model was trained using the following steps:
	1. Data Streaming and Preprocessing: Wikipedia datasets for Arabic and Darija were streamed using `datasets` library and preprocessed.
	2. SentencePiece Tokenization: A SentencePiece model was trained on a sample of the Arabic Wikipedia data.
	3. Model Training: A GPT-like Transformer model was trained from scratch using PyTorch.
	4. Memory Optimization: Memory mapping was used to handle large datasets efficiently.
	5. Robust Download: Implemented retry mechanisms for robust dataset downloading.

	[TODO: Add more specific details about your training process, optimizer, learning rate schedule, hardware used, training time etc.]

	## Evaluation

	[TODO: Include evaluation metrics if you have them. It's highly recommended to evaluate your model and add metrics here. For example, you could calculate perplexity on a held-out validation set.]
	- [Metrics and results on a validation set or benchmark.]

	## Citation

	[TODO: Add citation information if applicable. If you want to be cited, provide the preferred citation format.]

	## Model Card Contact

	[TODO: Add your contact information so people can reach out with questions or feedback.]
	- [Your name/organization]
	- [Your email/website/Hugging Face profile]