File size: 2,228 Bytes
3731a95
 
 
 
 
 
dfe004c
3731a95
bc8dc53
3731a95
bc8dc53
 
3731a95
bc8dc53
 
 
3731a95
 
bc8dc53
3731a95
bc8dc53
3731a95
bc8dc53
 
3731a95
dfe004c
bc8dc53
 
3731a95
bc8dc53
 
 
 
 
3731a95
bc8dc53
 
 
3731a95
bc8dc53
3731a95
bc8dc53
3731a95
bc8dc53
3731a95
bc8dc53
 
3731a95
bc8dc53
3731a95
bc8dc53
3731a95
bc8dc53
3731a95
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
library_name: transformers
tags: []
---

# Model Card for Model ID
## nllb-200-600M-En-Ar

This model is a fine-tuned version of the NLLB-200-600M model, specifically adapted for translating from English to Egyptian Arabic. Fine-tuned on a custom dataset of 12,000 samples, it aims to provide high-quality translations that capture the nuances and colloquial expressions of Egyptian Arabic.

The dataset used for fine-tuning was collected from high-quality transcriptions of videos, ensuring the language data is rich and contextually accurate.
### Model Details

- **Base Model**: [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
- **Language Pair**: English to Egyptian Arabic
- **Dataset**: 12,000 custom translation pairs


### Usage

To use this model for translation, you can load it with the `transformers` library:

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Mhassanen/nllb-200-600M-En-Ar"
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="eng_Latn", tgt_lang="arz_Arab")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True)
    translated_tokens = model.generate(**inputs)
    translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
    return translated_text

text = "Hello, how are you?"
print(translate(text))
```

### Performance

The model has been evaluated on a validation set to ensure translation quality. While it excels at capturing colloquial Egyptian Arabic, ongoing improvements and additional data can further enhance its performance.

### Limitations

- **Dataset Size**: The custom dataset consists of 12,000 samples, which may limit coverage of diverse expressions and rare terms.
- **Colloquial Variations**: Egyptian Arabic has many dialectal variations, which might not all be covered equally.

### Acknowledgements

This model builds upon the [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-600M) developed by Facebook AI, fine-tuned to cater specifically to the Egyptian Arabic dialect.

Feel free to contribute or provide feedback to help improve this model!