Tokenizer config's `chat_template` removes everything before `</think>` XML closing tag
#21
by
jamesbraza
- opened
The tokenizer's chat_template
is set up to remove everything before </think>
XML closing tag. This part of the template does this:
{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}
This was added in https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/commit/22b62e0c62bf91c28f2628f8f974338fe0d1d66b#d2h-846292
Can we remove this aspect of the template, or document its purpose?
The issue is, if one has reasoning traces in a dataset housed in <think></think>
XML tags, this chat_template
will unexpectedly clobber the reasoning.