EuroBERT/EuroBERT-210m · EOS token is also padding token

Hello!

There's some weird behavior with the tokenizer. When encoding text using the tokenizer from HF, it does not include an eos token, e.g.:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m")
tok.encode("dogs")
# output: [128000, 18964]

The tokenizer seems to use <end_of_text> as the padding token, however:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m")
tok.batch_encode_plus(["dogs", "many cats"], padding=True)
# output: [[128000, 81134, 128001], [128000, 35676, 19987]]

When we look at the special tokens, the <end_of_text> token is indeed stored as both the padding token and eos_token. Because it is also stored as the pad token, any eos tokens are truncated after encoding.

{'bos_token': '<|begin_of_text|>',
 'eos_token': '<|end_of_text|>',
 'pad_token': '<|end_of_text|>',
 'mask_token': '<|mask|>'}

A cursory look at the other tokens showed that there doesn't seem to be a dedicated padding token in the vocabulary.

When using the bare tokenizer (the backend model), every instance is padded until length 512 with <end_of_text> tokens. This is in the config (it has a fixed padding strategy with 512 tokens, but it looked a little bit weird to me.

So I'm just here to confirm whether this is intended, or whether the tokenizer should have a dedicated padding token which went missing. Thanks!