Allow prefilling assistant message
The current chat template wraps every message in '<|im_start|>' + message.role + '\n'
and '<|im_end|>' + '\n'
, and adds '<|im_start|>assistant\n<think>\n'
at the end of all the messages.
This means it's not possible to send a prefill assistant message that the model continues the generation from.
However, if we change the template to this: https://gist.github.com/tomasmcm/6fd3397eb44e3fbef4cf876451098a92 (note the loop.last
checks and role != "assistant"
at the end), we are able to have the model continue from a message it received.
This approach would allow building "reasoning_effort" or "thinking_budget_tokens" solutions. By counting the thinking tokens as they are generated we can ensure it does not go over a limit, and if it does, we halt the generation, add \n</think>\n\n
to the end, and send it back to the model as a prefilled assistant message. This way the model continues using the existing <think>
process and generates the final answer.
I've built an example proxy for how to leverage this prefilling technique to allow for a max_thinking_chars
parameter. This is currently working using the template I shared and Qwen/QwQ-32B via LM Studio.
https://github.com/tomasmcm/dttm-proxy