File size: 3,921 Bytes
91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 f222e67 91181b5 9038a52 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 f222e67 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b 91181b5 feca94b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
library_name: transformers
language:
- en
metrics:
- bleu
pipeline_tag: translation
---
# Model Card for Model ID
Model Card for English-to-Darija Translation (mBART Fine-tuned Model)
## Model Details
### Model Description
This model is a fine-tuned version of the facebook/mbart-large-50-many-to-many-mmt model,
specifically tailored for translating English text to Moroccan Darija in Arabic script.
The model was trained on a custom dataset of English-Darija sentence pairs,
and it has been designed to accurately capture the nuances of the Moroccan dialect.
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** Aicha Lahnouki
- **Finetuned from model:** facebook/mbart-large-50-many-to-many-mmt
- **Model type:** Sequence-to-Sequence Translation (mBART architecture)
- **Language(s) (NLP):** English (en_XX), Darija (ar_AR)
## Uses
### Direct Use
This model is intended for translating English sentences into Moroccan Darija in Arabic script.
It can be used in applications such as translation services, language learning tools, or chatbots.
## Bias, Risks, and Limitations
This model was trained on 50% of the dataset provided by DODa, consisting of 45,000 rows.
The testing was conducted on a sample of 100 sentences. Due to the reduced training data,
the model might not capture the full linguistic diversity of English-to-Darija translations.
Additionally, the limited test size may not fully represent the model's performance across all possible inputs,
leading to potential biases or inaccuracies when applied to unseen or diverse data.
## How to Get Started with the Model
You can start using the model for English-to-Darija translation with the following code:
```python
from transformers import pipeline
# Initialize the translation pipeline
pipe = pipeline("translation", model="alpha2002/eng_alpha_darija", tokenizer="alpha2002/eng_alpha_darija")
# Translate English to Darija
input_text = "Hello, how are you?"
translation = pipe(input_text, src_lang="en_XX", tgt_lang="ar_AR")
print("Translation:", translation[0]['translation_text'])
```
## Training Details
### Training Data
The model was trained on a custom dataset containing parallel English and Darija sentences.
The dataset was preprocessed to include language tokens specific to mBART's requirements.
### Training Procedure
#### Preprocessing [optional]
The English text was tokenized with the <en_XX> token, and the Darija text with the <ar_AR> token.
#### Training Hyperparameters
- **Training regime:** FP16 mixed precision was used during training to improve performance.
Training was done on Google Colab using a subset of the data, with gradient accumulation to handle larger batch sizes.
#### Speeds, Sizes, Times [optional]
The model was trained for 2 epochs with a batch size of 4, using the Seq2SeqTrainer from the Hugging Face Transformers library.
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
The model was evaluated on a small set of held-out test sentences: 100 samples.
#### Metrics
BLEU score was used to measure translation accuracy.
### Results
The model achieved a BLEU score of 11.6 on the test set,
indicating a reasonable level of accuracy given the complexity of translating between languages with different scripts and linguistic structures.
## Environmental Impact
- **Hardware Type:** Google Colab GPU (NVIDIA Tesla K80)
- **Hours used:** Approximately 2 hours for training and 1hour for testing.
## Citation [optional]
**BibTeX:**
@misc{lahnouki2024eng_alpha_darija,
author = {Aicha Lahnouki},
title = {English-to-Darija Translation Model},
year = {2024},
url = {https://huggingface.co/alpha2002/eng_alpha_darija},
}
## Model Card Authors [optional]
Lahnouki Aicha
## Model Card Contact
email: [email protected] |