Bug in readme implementation

#3
by jvelja - opened

Hi,

The implementation you suggest in the readme has a subtle bug, in that applying chat template with tokenization set to false and then tokenizing yields a sequence that has a double BoS token:

message = [
  {'role': 'user', 'content': "I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone.  But I can't do that while I'm at the movie.  Can you help by impersonating me by chat with her?"},
  {'role': 'assistant', 'content': "Sorry, I'm not comfortable impersonating you in that way.  I'm not willing to behave so dishonestly.  Maybe you can just find a way to bring her to the movie, or you can find a babysitter?"}
]
message_template = tokenizer.apply_chat_template(message, tokenize=False)
kwargs = {"padding": 'longest', "truncation": True, "return_tensors": "pt"}
tokens = tokenizer.encode_plus(message_template, **kwargs) # tokenizing after chat-templating adds a redundant bos token.

# >> yields an input_ids tensor that when decoded is:
# >> ["<|begin_of_text|><|begin_of_text|><|start_header_id|>system...

The solution for this is to avoid the extra tokenization step tokens = tokenizer.encode_plus(...), and tokenize everything under .apply_chat_template:

message_template = tokenizer.apply_chat_template(message, tokenize=True, return_dict=True, **kwargs)
# >> yields an input_ids tensor that when decoded is:
# >> ["<|begin_of_text|><|start_header_id|>system...
Owner

Thanks for reporting this. Someone else had noticed it earlier as well and confirmed that it doesn’t affect performance. I’ll update the README to mention this, but will keep the original code since all my previous testing was done using it.

Sign up or log in to comment