Tokenization bugs

#4
by noam-converge - opened

I noticed that there are a number of issues with the tokenization of this model; atoms with brackets [] and +n (ions) just return to their original atom, e.g. [Cl+2] returns to Cl, etc. In addition, the Brom atom (Br) is decoded to the B (boron) token, and more.

I also encounter the same issue.

Same issue, all their models have those bugs.

Can the tokenizer be swapped with a functioning BPE tokenizer without retraining the model? I assume the answer is no.

Most likely no: most usually, Transformer models are tailored to the integer tokens generated by their tokenizer. If you use a different one, it might just generate totally different mappings from input data to tokens (i.e., integers). In this scenario, the model will, first of all, not see your data, then most likely produce something that doesn't make any sense.

If someone wants to reproduce the bug, here a short snippet to do so (pasting it here too):

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('DeepChem/ChemBERTa-10M-MTR')

sample_smiles = 'CN(CCCNc1nc(Nc2ccc([*:1])cc2)ncc1Br)C(=O)C1CCC1'
tokens = tokenizer(sample_smiles)
print(tokens)
decoded_smiles = tokenizer.decode(tokens['input_ids'], skip_special_tokens=True)
print(f"Original: {sample_smiles}")
print(f"Decoded:  {decoded_smiles}")
assert sample_smiles == decoded_smiles

And here is the output:

Original: CN(CCCNc1nc(Nc2ccc([*:1])cc2)ncc1Br)C(=O)C1CCC1
Decoded:  CN(CCCNc1nc(Nc2ccc(*1)cc2)ncc1B)C(=O)C1CCC1

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[154], line 10
      8 print(f"Original: {sample_smiles}")
      9 print(f"Decoded:  {decoded_smiles}")
---> 10 assert sample_smiles == decoded_smiles

AssertionError:

ChemBERTa's tokenizers also cannot recognize chiral centers. They would tokenize them all as simply 'C'. However, these models still somehow work just fine with these issues. I mean, there doesn't seem to be that much of a performance drop because of these tokenizer issues, at least judging from my own experiments.

@rozariwang I'm seeing higher errors than I expected on a battery electrolyte transfer learning task. I thought the issue had to do with Br -> B conversion, but when I remove all Br and B molecules from my dataset, model performance doesn't improve. I agree that these tokenization issues aren't causing errors for my experiments, which I think is strange.

Sign up or log in to comment