EOS token is also padding token
Hello!
There's some weird behavior with the tokenizer. When encoding text using the tokenizer from HF, it does not include an eos token, e.g.:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m")
tok.encode("dogs")
# output: [128000, 18964]
The tokenizer seems to use <end_of_text>
as the padding token, however:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m")
tok.batch_encode_plus(["dogs", "many cats"], padding=True)
# output: [[128000, 81134, 128001], [128000, 35676, 19987]]
When we look at the special tokens, the <end_of_text>
token is indeed stored as both the padding token and eos_token. Because it is also stored as the pad token, any eos tokens are truncated after encoding.
{'bos_token': '<|begin_of_text|>',
'eos_token': '<|end_of_text|>',
'pad_token': '<|end_of_text|>',
'mask_token': '<|mask|>'}
A cursory look at the other tokens showed that there doesn't seem to be a dedicated padding token in the vocabulary.
When using the bare tokenizer (the backend model), every instance is padded until length 512 with <end_of_text>
tokens. This is in the config (it has a fixed padding strategy with 512 tokens, but it looked a little bit weird to me.
So I'm just here to confirm whether this is intended, or whether the tokenizer should have a dedicated padding token which went missing. Thanks!
Hello!
Thank you for your interest!
Indeed, you can use the eos token as a padding, this is what we have been doing during fine-tuning.
Regarding the default padding value being 512, that was a misconfiguration in the config file. I have updated it to remove that information.
It should be fixed if you redownload the tokenizer:
tokenizer = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m", force_download=True)
Let me know if this fixes the issue for you!