File size: 4,401 Bytes
8fba768 a5c146b 8fba768 c15156f 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d c65a34c 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d 8fba768 f544e2d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
library_name: transformers
license: mit
datasets:
- galsenai/centralized_wolof_french_translation_data
language:
- wo
- fr
base_model:
- facebook/nllb-200-distilled-600M
pipeline_tag: translation
---
# Model Card: NLLB-200 French-Wolof(🇫🇷↔️🇸🇳) Translation Model
## Model Details
### Model Description
A fine-tuned version of Meta's NLLB-200 (600M distilled) model specialized for French to Wolof translation. This model was trained to improve accessibility of content between French and Wolof languages.
- **Developed by:** Lahad
- **Model type:** Sequence-to-Sequence Translation Model
- **Language(s):** French (fr_Latn) ↔️ Wolof (wol_Latn)
- **License:** CC-BY-NC-4.0
- **Finetuned from model:** facebook/nllb-200-distilled-600M
### Model Sources
- **Repository:** [Hugging Face - Lahad/nllb200-francais-wolof](https://huggingface.co/Lahad/nllb200-francais-wolof)
- **GitHub:** [Fine-tuning NLLB-200 for French-Wolof](https://github.com/LahadMbacke/Fine-tuning_facebook-nllb-200-distilled-600M_French_to_Wolof)
## Uses
### Direct Use
- Text translation between French and Wolof
- Content localization
- Language learning assistance
- Cross-cultural communication
### Out-of-Scope Use
- Commercial use without proper licensing
- Translation of highly technical or specialized content
- Legal or medical document translation where professional human translation is required
- Real-time speech translation
## Bias, Risks, and Limitations
1. Language Variety Limitations:
- Limited coverage of regional Wolof dialects
- May not handle cultural nuances effectively
2. Technical Limitations:
- Maximum context window of 128 tokens
- Reduced performance on technical/specialized content
- May struggle with informal language and slang
3. Potential Biases:
- Training data may reflect cultural biases
- May perform better on standard/formal language
## Recommendations
- Use for general communication and content translation
- Verify translations for critical communications
- Consider regional language variations
- Implement human review for sensitive content
- Test translations in intended context before deployment
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Lahad/nllb200-francais-wolof")
model = AutoModelForSeq2SeqLM.from_pretrained("Lahad/nllb200-francais-wolof")
# Translation function
def translate(text, max_length=128):
inputs = tokenizer(
text,
max_length=max_length,
padding="max_length",
truncation=True,
return_tensors="pt"
)
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
forced_bos_token_id=tokenizer.convert_tokens_to_ids("wol_Latn"),
max_length=max_length
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
```
## Training Details
### Training Data
- **Dataset:** galsenai/centralized_wolof_french_translation_data
- **Split:** 80% training, 20% testing
- **Format:** JSON pairs of French and Wolof translations
### Training Procedure
#### Preprocessing
- Dynamic tokenization with padding
- Maximum sequence length: 128 tokens
- Source/target language tags: fr_Latn/wol_Latn
#### Training Hyperparameters
- Learning rate: 2e-5
- Batch size: 8 per device
- Training epochs: 3
- FP16 training: Enabled
- Evaluation strategy: Per epoch
## Evaluation
### Testing Data, Factors & Metrics
- **Testing Data:** 20% of dataset
- **Metrics:**
- **Cloud Provider:**
- **Evaluation Factors:**
- Translation accuracy
- Semantic preservation
- Grammar correctness
## Environmental Impact
- **Hardware Type:** NVIDIA T4 GPU
- **Hours used:** 5
- **Cloud Provider:** [Not Specified]
- **Compute Region:** [Not Specified]
- **Carbon Emitted:** [Not Calculated]
## Technical Specifications
### Model Architecture and Objective
- Architecture: NLLB-200 (Distilled 600M version)
- Objective: Neural Machine Translation
- Parameters: 600M
- Context Window: 128 tokens
### Compute Infrastructure
- Training Hardware: NVIDIA T4 GPU
- Training Time: 5 hours
- Software Framework: Hugging Face Transformers
## Model Card Contact
For questions about this model, please create an issue on the model's Hugging Face repository.
|