meta-llama/Llama-3.1-8B-Instruct · BUG : Using `AutoTokenizer.from_pretrained`'s `.encode()` function fails to add BOS token

Jul 23, 2024

The Llama-3 tokenizer's .encode() function adds a BOS token, but the Llama-3.1 tokenizer's .encode() function does not. Is this intended behavior?

Example:

from transformers import AutoTokenizer

llama_3_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
llama_3_1_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

text = "Hello"

print(llama_3_tokenizer.encode(text))
print(llama_3_1_tokenizer.encode(text))

Output:

[128000, 9906]
[9906]

m18coppola

Jul 24, 2024

Fixed

m18coppola changed discussion status to closed Jul 24, 2024