Tokenizer config's `chat_template` removes everything before `</think>` XML closing tag

#21
by jamesbraza - opened

The tokenizer's chat_template is set up to remove everything before </think> XML closing tag. This part of the template does this:

{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}

This was added in https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/commit/22b62e0c62bf91c28f2628f8f974338fe0d1d66b#d2h-846292

Can we remove this aspect of the template, or document its purpose?

The issue is, if one has reasoning traces in a dataset housed in <think></think> XML tags, this chat_template will unexpectedly clobber the reasoning.

Sign up or log in to comment