BUG : Using `AutoTokenizer.from_pretrained`'s `.encode()` function fails to add BOS token
#21
by
m18coppola
- opened
The Llama-3 tokenizer's .encode()
function adds a BOS token, but the Llama-3.1 tokenizer's .encode()
function does not. Is this intended behavior?
Example:
from transformers import AutoTokenizer
llama_3_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
llama_3_1_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text = "Hello"
print(llama_3_tokenizer.encode(text))
print(llama_3_1_tokenizer.encode(text))
Output:
[128000, 9906]
[9906]
m18coppola
changed discussion status to
closed