Tokenization bugs

by noam-converge - opened Aug 19, 2024

Aug 19, 2024

•

edited Aug 19, 2024

I noticed that there are a number of issues with the tokenization of this model; atoms with brackets [] and +n (ions) just return to their original atom, e.g. [Cl+2] returns to Cl, etc. In addition, the Brom atom (Br) is decoded to the B (boron) token, and more.

rozariwang

Aug 24, 2024

I also encounter the same issue.

ribesstefano

8 days ago

•

edited 8 days ago

Same issue, all their models have those bugs.

cote3804

7 days ago

Can the tokenizer be swapped with a functioning BPE tokenizer without retraining the model? I assume the answer is no.

ribesstefano

7 days ago

•

edited 7 days ago

Most likely no: most usually, Transformer models are tailored to the integer tokens generated by their tokenizer. If you use a different one, it might just generate totally different mappings from input data to tokens (i.e., integers). In this scenario, the model will, first of all, not see your data, then most likely produce something that doesn't make any sense.

If someone wants to reproduce the bug, here a short snippet to do so (pasting it here too):

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('DeepChem/ChemBERTa-10M-MTR')

sample_smiles = 'CN(CCCNc1nc(Nc2ccc([*:1])cc2)ncc1Br)C(=O)C1CCC1'
tokens = tokenizer(sample_smiles)
print(tokens)
decoded_smiles = tokenizer.decode(tokens['input_ids'], skip_special_tokens=True)
print(f"Original: {sample_smiles}")
print(f"Decoded:  {decoded_smiles}")
assert sample_smiles == decoded_smiles

And here is the output:

Original: CN(CCCNc1nc(Nc2ccc([*:1])cc2)ncc1Br)C(=O)C1CCC1
Decoded:  CN(CCCNc1nc(Nc2ccc(*1)cc2)ncc1B)C(=O)C1CCC1

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[154], line 10
      8 print(f"Original: {sample_smiles}")
      9 print(f"Decoded:  {decoded_smiles}")
---> 10 assert sample_smiles == decoded_smiles

AssertionError:

rozariwang

7 days ago

ChemBERTa's tokenizers also cannot recognize chiral centers. They would tokenize them all as simply 'C'. However, these models still somehow work just fine with these issues. I mean, there doesn't seem to be that much of a performance drop because of these tokenizer issues, at least judging from my own experiments.

cote3804

7 days ago

@rozariwang I'm seeing higher errors than I expected on a battery electrolyte transfer learning task. I thought the issue had to do with Br -> B conversion, but when I remove all Br and B molecules from my dataset, model performance doesn't improve. I agree that these tokenization issues aren't causing errors for my experiments, which I think is strange.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment