Text Classification
Safetensors
llama

Update default tokenization behavior to "longest" in README

#2

When you use the default code in the README the tokenizer is set to max_length. This causes an OOM error even on an H100. This is because each sequence is padded to 131072 tokens, the max length for Llama 3.2 3b. A much more reasonable behavior is padding the max length of a sequence. This is accomplished by switching "padding": "longest" in the tokenizer kwargs.

Thank you!

Ray2333 changed pull request status to merged

Sign up or log in to comment