File size: 2,228 Bytes
3731a95 dfe004c 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 dfe004c bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 bc8dc53 3731a95 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
---
library_name: transformers
tags: []
---
# Model Card for Model ID
## nllb-200-600M-En-Ar
This model is a fine-tuned version of the NLLB-200-600M model, specifically adapted for translating from English to Egyptian Arabic. Fine-tuned on a custom dataset of 12,000 samples, it aims to provide high-quality translations that capture the nuances and colloquial expressions of Egyptian Arabic.
The dataset used for fine-tuning was collected from high-quality transcriptions of videos, ensuring the language data is rich and contextually accurate.
### Model Details
- **Base Model**: [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
- **Language Pair**: English to Egyptian Arabic
- **Dataset**: 12,000 custom translation pairs
### Usage
To use this model for translation, you can load it with the `transformers` library:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "Mhassanen/nllb-200-600M-En-Ar"
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="eng_Latn", tgt_lang="arz_Arab")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def translate(text):
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated_tokens = model.generate(**inputs)
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
return translated_text
text = "Hello, how are you?"
print(translate(text))
```
### Performance
The model has been evaluated on a validation set to ensure translation quality. While it excels at capturing colloquial Egyptian Arabic, ongoing improvements and additional data can further enhance its performance.
### Limitations
- **Dataset Size**: The custom dataset consists of 12,000 samples, which may limit coverage of diverse expressions and rare terms.
- **Colloquial Variations**: Egyptian Arabic has many dialectal variations, which might not all be covered equally.
### Acknowledgements
This model builds upon the [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-600M) developed by Facebook AI, fine-tuned to cater specifically to the Egyptian Arabic dialect.
Feel free to contribute or provide feedback to help improve this model!
|