File size: 3,921 Bytes
91181b5
 
feca94b
 
 
 
 
91181b5
 
 
 
feca94b
91181b5
 
 
 
 
 
feca94b
 
 
 
91181b5
 
feca94b
 
 
 
91181b5
 
 
 
 
 
feca94b
 
91181b5
 
 
 
feca94b
 
 
 
 
91181b5
 
feca94b
91181b5
f222e67
91181b5
9038a52
feca94b
91181b5
feca94b
 
91181b5
feca94b
 
 
91181b5
f222e67
 
91181b5
 
 
 
 
feca94b
 
91181b5
 
 
 
 
 
feca94b
91181b5
 
 
feca94b
 
91181b5
 
feca94b
91181b5
feca94b
91181b5
 
 
 
 
 
 
 
feca94b
91181b5
 
 
 
feca94b
91181b5
 
 
feca94b
 
91181b5
 
 
 
 
feca94b
 
91181b5
 
 
 
 
 
 
feca94b
 
 
 
 
 
91181b5
 
 
feca94b
91181b5
 
feca94b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
library_name: transformers
language:
- en
metrics:
- bleu
pipeline_tag: translation
---

# Model Card for Model ID

Model Card for English-to-Darija Translation (mBART Fine-tuned Model)


## Model Details

### Model Description

This model is a fine-tuned version of the facebook/mbart-large-50-many-to-many-mmt model, 
specifically tailored for translating English text to Moroccan Darija in Arabic script. 
The model was trained on a custom dataset of English-Darija sentence pairs,
and it has been designed to accurately capture the nuances of the Moroccan dialect.
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** Aicha Lahnouki
- **Finetuned from model:** facebook/mbart-large-50-many-to-many-mmt
- **Model type:** Sequence-to-Sequence Translation (mBART architecture)
- **Language(s) (NLP):** English (en_XX), Darija (ar_AR)


## Uses

### Direct Use

This model is intended for translating English sentences into Moroccan Darija in Arabic script. 
It can be used in applications such as translation services, language learning tools, or chatbots.


## Bias, Risks, and Limitations

This model was trained on 50% of the dataset provided by DODa, consisting of 45,000 rows. 
The testing was conducted on a sample of 100 sentences. Due to the reduced training data, 
the model might not capture the full linguistic diversity of English-to-Darija translations. 
Additionally, the limited test size may not fully represent the model's performance across all possible inputs,
leading to potential biases or inaccuracies when applied to unseen or diverse data.


## How to Get Started with the Model

You can start using the model for English-to-Darija translation with the following code:

```python
from transformers import pipeline

# Initialize the translation pipeline
pipe = pipeline("translation", model="alpha2002/eng_alpha_darija", tokenizer="alpha2002/eng_alpha_darija")

# Translate English to Darija
input_text = "Hello, how are you?"
translation = pipe(input_text, src_lang="en_XX", tgt_lang="ar_AR")

print("Translation:", translation[0]['translation_text'])
```

## Training Details

### Training Data

The model was trained on a custom dataset containing parallel English and Darija sentences. 
The dataset was preprocessed to include language tokens specific to mBART's requirements.

### Training Procedure


#### Preprocessing [optional]

The English text was tokenized with the <en_XX> token, and the Darija text with the <ar_AR> token.

#### Training Hyperparameters

- **Training regime:**  FP16 mixed precision was used during training to improve performance.
  Training was done on Google Colab using a subset of the data, with gradient accumulation to handle larger batch sizes.


#### Speeds, Sizes, Times [optional]

The model was trained for 2 epochs with a batch size of 4, using the Seq2SeqTrainer from the Hugging Face Transformers library.

## Evaluation


### Testing Data, Factors & Metrics

#### Testing Data

The model was evaluated on a small set of held-out test sentences: 100 samples.


#### Metrics

BLEU score was used to measure translation accuracy.

### Results

The model achieved a BLEU score of 11.6 on the test set, 
indicating a reasonable level of accuracy given the complexity of translating between languages with different scripts and linguistic structures.



## Environmental Impact

- **Hardware Type:** Google Colab GPU (NVIDIA Tesla K80)
- **Hours used:** Approximately 2 hours for training and 1hour for testing.



## Citation [optional]

**BibTeX:**

@misc{lahnouki2024eng_alpha_darija,
  author = {Aicha Lahnouki},
  title = {English-to-Darija Translation Model},
  year = {2024},
  url = {https://huggingface.co/alpha2002/eng_alpha_darija},
}

## Model Card Authors [optional]

Lahnouki Aicha
## Model Card Contact

email: [email protected]