|
--- |
|
language: multilingual |
|
license: mit |
|
base_model: |
|
- facebook/m2m100-12B-avg-5-ckpt |
|
--- |
|
# M2M100 12B (average of last 5 checkpoints) |
|
|
|
- This is a copy of the model repository facebook/m2m100-12B-avg-5-ckpt, |
|
"a multilingual encoder-decoder (seq-to-seq) model trained for |
|
Many-to-Many multilingual translation". |
|
- The model in the original repository is a single file of size 47.2 GB |
|
which can be an issue for people behind proxies where downloading |
|
files greater than xxGB is not permitted. |
|
|
|
Steps: |
|
- The model weights have been converted to `bfloat16`. |
|
- The model file has been chunked into files no greater than 5 GB. |
|
|
|
## Usage |
|
|
|
Sample usage: |
|
|
|
```python |
|
from transformers import M2M100Tokenizer, M2M100ForConditionalGeneration |
|
from threading import Lock |
|
|
|
model_name = 'Didier/m2m100-12B-avg-5-ckpt' |
|
|
|
device = 'mps' # if on Apple silicon |
|
tokenizer = M2M100Tokenizer.from_pretrained(model_name) |
|
model = M2M100ForConditionalGeneration.from_pretrained( |
|
model_name, device_map=device, low_cpu_mem_usage=True) |
|
lock = Lock() |
|
|
|
def translate(text: str, src_lang: str, tgt_lang: str) -> str: |
|
# Acquire lock to set src_lang and tokenize atomically |
|
with lock: |
|
tokenizer.src_lang = src_lang |
|
input_ids = tokenizer([text,], return_tensors="pt").input_ids.to(model.device) |
|
|
|
# Generate translation (outside the lock to allow parallel model |
|
outputs = model.generate( |
|
input_ids=input_ids, |
|
forced_bos_token_id=tokenizer.get_lang_id(tgt_lang)) |
|
translation = tokenizer.batch_decode( |
|
outputs, skip_special_tokens=True)[0] |
|
|
|
return translation |
|
|
|
|
|
text = "ist der Ruf erst ruiniert, lebt es sich ganz ungeniert." |
|
src_lang = 'de' |
|
tgt_lang = 'en' |
|
|
|
translation = translate(text, src_lang, tgt_lang) |
|
print(f"{translation=}") |
|
|
|
# --> "Once your reputation is ruined, you can live quite freely." |
|
``` |
|
|