File size: 4,401 Bytes
8fba768
 
a5c146b
 
 
 
 
 
 
 
 
8fba768
 
c15156f
8fba768
 
 
 
f544e2d
8fba768
f544e2d
 
 
 
 
8fba768
f544e2d
 
c65a34c
8fba768
 
 
 
f544e2d
 
 
 
8fba768
 
f544e2d
 
 
 
8fba768
 
f544e2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8fba768
 
 
f544e2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8fba768
 
 
 
f544e2d
 
 
8fba768
 
f544e2d
 
 
 
8fba768
 
f544e2d
 
 
 
 
8fba768
 
 
 
f544e2d
 
 
 
 
 
 
8fba768
 
f544e2d
 
 
 
 
8fba768
f544e2d
8fba768
 
f544e2d
 
 
 
8fba768
 
f544e2d
 
 
8fba768
 
f544e2d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
library_name: transformers
license: mit
datasets:
- galsenai/centralized_wolof_french_translation_data
language:
- wo
- fr
base_model:
- facebook/nllb-200-distilled-600M
pipeline_tag: translation
---

# Model Card: NLLB-200 French-Wolof(🇫🇷↔️🇸🇳) Translation Model

## Model Details

### Model Description
A fine-tuned version of Meta's NLLB-200 (600M distilled) model specialized for French to Wolof translation. This model was trained to improve accessibility of content between French and Wolof languages.

- **Developed by:** Lahad
- **Model type:** Sequence-to-Sequence Translation Model
- **Language(s):** French (fr_Latn) ↔️ Wolof (wol_Latn)
- **License:** CC-BY-NC-4.0
- **Finetuned from model:** facebook/nllb-200-distilled-600M

### Model Sources
- **Repository:** [Hugging Face - Lahad/nllb200-francais-wolof](https://huggingface.co/Lahad/nllb200-francais-wolof)
- **GitHub:** [Fine-tuning NLLB-200 for French-Wolof](https://github.com/LahadMbacke/Fine-tuning_facebook-nllb-200-distilled-600M_French_to_Wolof)

## Uses

### Direct Use
- Text translation between French and Wolof
- Content localization
- Language learning assistance
- Cross-cultural communication

### Out-of-Scope Use
- Commercial use without proper licensing
- Translation of highly technical or specialized content
- Legal or medical document translation where professional human translation is required
- Real-time speech translation

## Bias, Risks, and Limitations
1. Language Variety Limitations:
   - Limited coverage of regional Wolof dialects
   - May not handle cultural nuances effectively
   
2. Technical Limitations:
   - Maximum context window of 128 tokens
   - Reduced performance on technical/specialized content
   - May struggle with informal language and slang

3. Potential Biases:
   - Training data may reflect cultural biases
   - May perform better on standard/formal language

## Recommendations
- Use for general communication and content translation
- Verify translations for critical communications
- Consider regional language variations
- Implement human review for sensitive content
- Test translations in intended context before deployment

## How to Get Started with the Model

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Lahad/nllb200-francais-wolof")
model = AutoModelForSeq2SeqLM.from_pretrained("Lahad/nllb200-francais-wolof")

# Translation function
def translate(text, max_length=128):
    inputs = tokenizer(
        text,
        max_length=max_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )
    
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("wol_Latn"),
        max_length=max_length
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
```

## Training Details

### Training Data
- **Dataset:** galsenai/centralized_wolof_french_translation_data
- **Split:** 80% training, 20% testing
- **Format:** JSON pairs of French and Wolof translations

### Training Procedure
#### Preprocessing
- Dynamic tokenization with padding
- Maximum sequence length: 128 tokens
- Source/target language tags: fr_Latn/wol_Latn

#### Training Hyperparameters
- Learning rate: 2e-5
- Batch size: 8 per device
- Training epochs: 3
- FP16 training: Enabled
- Evaluation strategy: Per epoch

## Evaluation

### Testing Data, Factors & Metrics
- **Testing Data:** 20% of dataset
- **Metrics:**
  - **Cloud Provider:** 
- **Evaluation Factors:**
  - Translation accuracy
  - Semantic preservation
  - Grammar correctness

## Environmental Impact
- **Hardware Type:** NVIDIA T4 GPU
- **Hours used:** 5
- **Cloud Provider:** [Not Specified]
- **Compute Region:** [Not Specified]
- **Carbon Emitted:** [Not Calculated]

## Technical Specifications

### Model Architecture and Objective
- Architecture: NLLB-200 (Distilled 600M version)
- Objective: Neural Machine Translation
- Parameters: 600M
- Context Window: 128 tokens

### Compute Infrastructure
- Training Hardware: NVIDIA T4 GPU
- Training Time: 5 hours
- Software Framework: Hugging Face Transformers

## Model Card Contact
For questions about this model, please create an issue on the model's Hugging Face repository.