|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- galsenai/centralized_wolof_french_translation_data |
|
language: |
|
- wo |
|
- fr |
|
base_model: |
|
- facebook/nllb-200-distilled-600M |
|
pipeline_tag: translation |
|
--- |
|
|
|
# Model Card: NLLB-200 French-Wolof(🇫🇷↔️🇸🇳) Translation Model |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
A fine-tuned version of Meta's NLLB-200 (600M distilled) model specialized for French to Wolof translation. This model was trained to improve accessibility of content between French and Wolof languages. |
|
|
|
- **Developed by:** Lahad |
|
- **Model type:** Sequence-to-Sequence Translation Model |
|
- **Language(s):** French (fr_Latn) ↔️ Wolof (wol_Latn) |
|
- **License:** CC-BY-NC-4.0 |
|
- **Finetuned from model:** facebook/nllb-200-distilled-600M |
|
|
|
### Model Sources |
|
- **Repository:** [Hugging Face - Lahad/nllb200-francais-wolof](https://huggingface.co/Lahad/nllb200-francais-wolof) |
|
- **GitHub:** [Fine-tuning NLLB-200 for French-Wolof](https://github.com/LahadMbacke/Fine-tuning_facebook-nllb-200-distilled-600M_French_to_Wolof) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
- Text translation between French and Wolof |
|
- Content localization |
|
- Language learning assistance |
|
- Cross-cultural communication |
|
|
|
### Out-of-Scope Use |
|
- Commercial use without proper licensing |
|
- Translation of highly technical or specialized content |
|
- Legal or medical document translation where professional human translation is required |
|
- Real-time speech translation |
|
|
|
## Bias, Risks, and Limitations |
|
1. Language Variety Limitations: |
|
- Limited coverage of regional Wolof dialects |
|
- May not handle cultural nuances effectively |
|
|
|
2. Technical Limitations: |
|
- Maximum context window of 128 tokens |
|
- Reduced performance on technical/specialized content |
|
- May struggle with informal language and slang |
|
|
|
3. Potential Biases: |
|
- Training data may reflect cultural biases |
|
- May perform better on standard/formal language |
|
|
|
## Recommendations |
|
- Use for general communication and content translation |
|
- Verify translations for critical communications |
|
- Consider regional language variations |
|
- Implement human review for sensitive content |
|
- Test translations in intended context before deployment |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
# Load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("Lahad/nllb200-francais-wolof") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("Lahad/nllb200-francais-wolof") |
|
|
|
# Translation function |
|
def translate(text, max_length=128): |
|
inputs = tokenizer( |
|
text, |
|
max_length=max_length, |
|
padding="max_length", |
|
truncation=True, |
|
return_tensors="pt" |
|
) |
|
|
|
outputs = model.generate( |
|
input_ids=inputs["input_ids"], |
|
attention_mask=inputs["attention_mask"], |
|
forced_bos_token_id=tokenizer.convert_tokens_to_ids("wol_Latn"), |
|
max_length=max_length |
|
) |
|
|
|
return tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
- **Dataset:** galsenai/centralized_wolof_french_translation_data |
|
- **Split:** 80% training, 20% testing |
|
- **Format:** JSON pairs of French and Wolof translations |
|
|
|
### Training Procedure |
|
#### Preprocessing |
|
- Dynamic tokenization with padding |
|
- Maximum sequence length: 128 tokens |
|
- Source/target language tags: fr_Latn/wol_Latn |
|
|
|
#### Training Hyperparameters |
|
- Learning rate: 2e-5 |
|
- Batch size: 8 per device |
|
- Training epochs: 3 |
|
- FP16 training: Enabled |
|
- Evaluation strategy: Per epoch |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
- **Testing Data:** 20% of dataset |
|
- **Metrics:** |
|
- **Cloud Provider:** |
|
- **Evaluation Factors:** |
|
- Translation accuracy |
|
- Semantic preservation |
|
- Grammar correctness |
|
|
|
## Environmental Impact |
|
- **Hardware Type:** NVIDIA T4 GPU |
|
- **Hours used:** 5 |
|
- **Cloud Provider:** [Not Specified] |
|
- **Compute Region:** [Not Specified] |
|
- **Carbon Emitted:** [Not Calculated] |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
- Architecture: NLLB-200 (Distilled 600M version) |
|
- Objective: Neural Machine Translation |
|
- Parameters: 600M |
|
- Context Window: 128 tokens |
|
|
|
### Compute Infrastructure |
|
- Training Hardware: NVIDIA T4 GPU |
|
- Training Time: 5 hours |
|
- Software Framework: Hugging Face Transformers |
|
|
|
## Model Card Contact |
|
For questions about this model, please create an issue on the model's Hugging Face repository. |
|
|