update doc, add function call and reason parser back. (#14)
Browse files- update doc, add function call and reason parser back. (96b28892f2c31d467cc7299cf198c464ca679a7d)
Co-authored-by: asher <[email protected]>
README.md
CHANGED
|
@@ -98,7 +98,9 @@ Our model defaults to using slow-thinking reasoning, and there are two ways to d
|
|
| 98 |
1. Pass "enable_thinking=False" when calling apply_chat_template.
|
| 99 |
2. Adding "/no_think" before the prompt will force the model not to use perform CoT reasoning. Similarly, adding "/think" before the prompt will force the model to perform CoT reasoning.
|
| 100 |
|
| 101 |
-
The following code snippet shows how to use the transformers library to load and apply the model.
|
|
|
|
|
|
|
| 102 |
|
| 103 |
|
| 104 |
|
|
@@ -135,6 +137,28 @@ print(f"thinking_content:{think_content}\n\n")
|
|
| 135 |
print(f"answer_content:{answer_content}\n\n")
|
| 136 |
```
|
| 137 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
## Deployment
|
| 139 |
|
| 140 |
For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
|
|
@@ -195,7 +219,7 @@ trtllm-serve \
|
|
| 195 |
```
|
| 196 |
|
| 197 |
|
| 198 |
-
###
|
| 199 |
|
| 200 |
#### Docker Image
|
| 201 |
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
|
|
@@ -217,25 +241,61 @@ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
|
|
| 217 |
|
| 218 |
model download by huggingface:
|
| 219 |
```
|
| 220 |
-
docker run
|
| 221 |
-v ~/.cache:/root/.cache/ \
|
| 222 |
-
--
|
| 223 |
-
\
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 227 |
```
|
| 228 |
|
| 229 |
model downloaded by modelscope:
|
| 230 |
```
|
| 231 |
-
docker run
|
| 232 |
-v ~/.cache/modelscope:/root/.cache/modelscope \
|
| 233 |
-
--
|
| 234 |
-
|
| 235 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
```
|
| 237 |
|
| 238 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 239 |
### SGLang
|
| 240 |
|
| 241 |
#### Docker Image
|
|
|
|
| 98 |
1. Pass "enable_thinking=False" when calling apply_chat_template.
|
| 99 |
2. Adding "/no_think" before the prompt will force the model not to use perform CoT reasoning. Similarly, adding "/think" before the prompt will force the model to perform CoT reasoning.
|
| 100 |
|
| 101 |
+
The following code snippet shows how to use the transformers library to load and apply the model.
|
| 102 |
+
It also demonstrates how to enable and disable the reasoning mode ,
|
| 103 |
+
and how to parse the reasoning process along with the final output.
|
| 104 |
|
| 105 |
|
| 106 |
|
|
|
|
| 137 |
print(f"answer_content:{answer_content}\n\n")
|
| 138 |
```
|
| 139 |
|
| 140 |
+
### Fast and slow thinking switch
|
| 141 |
+
|
| 142 |
+
This model supports two modes of operation:
|
| 143 |
+
|
| 144 |
+
- Slow Thinking Mode (Default): Enables detailed internal reasoning steps before producing the final answer.
|
| 145 |
+
- Fast Thinking Mode: Skips the internal reasoning process for faster inference, going straight to the final answer.
|
| 146 |
+
|
| 147 |
+
**Switching to Fast Thinking Mode:**
|
| 148 |
+
|
| 149 |
+
To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
|
| 150 |
+
```
|
| 151 |
+
tokenized_chat = tokenizer.apply_chat_template(
|
| 152 |
+
messages,
|
| 153 |
+
tokenize=True,
|
| 154 |
+
add_generation_prompt=True,
|
| 155 |
+
return_tensors="pt",
|
| 156 |
+
enable_thinking=False # Use fast thinking mode
|
| 157 |
+
)
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
|
| 162 |
## Deployment
|
| 163 |
|
| 164 |
For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
|
|
|
|
| 219 |
```
|
| 220 |
|
| 221 |
|
| 222 |
+
### vLLM
|
| 223 |
|
| 224 |
#### Docker Image
|
| 225 |
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
|
|
|
|
| 241 |
|
| 242 |
model download by huggingface:
|
| 243 |
```
|
| 244 |
+
docker run --rm --ipc=host \
|
| 245 |
-v ~/.cache:/root/.cache/ \
|
| 246 |
+
--security-opt seccomp=unconfined \
|
| 247 |
+
--net=host \
|
| 248 |
+
--gpus=all \
|
| 249 |
+
-it \
|
| 250 |
+
-e VLLM_USE_V1=0 \
|
| 251 |
+
--entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
|
| 252 |
+
-m vllm.entrypoints.openai.api_server \
|
| 253 |
+
--host 0.0.0.0 \
|
| 254 |
+
--tensor-parallel-size 4 \
|
| 255 |
+
--port 8000 \
|
| 256 |
+
--model tencent/Hunyuan-A13B-Instruct \
|
| 257 |
+
--trust_remote_code
|
| 258 |
```
|
| 259 |
|
| 260 |
model downloaded by modelscope:
|
| 261 |
```
|
| 262 |
+
docker run --rm --ipc=host \
|
| 263 |
-v ~/.cache/modelscope:/root/.cache/modelscope \
|
| 264 |
+
--security-opt seccomp=unconfined \
|
| 265 |
+
--net=host \
|
| 266 |
+
--gpus=all \
|
| 267 |
+
-it \
|
| 268 |
+
-e VLLM_USE_V1=0 \
|
| 269 |
+
--entrypoint python mirror.ccs.tencentyun.com/hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm \
|
| 270 |
+
-m vllm.entrypoints.openai.api_server \
|
| 271 |
+
--host 0.0.0.0 \
|
| 272 |
+
--tensor-parallel-size 4 \
|
| 273 |
+
--port 8000 \
|
| 274 |
+
--model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ \
|
| 275 |
+
--trust_remote_code
|
| 276 |
```
|
| 277 |
|
| 278 |
|
| 279 |
+
#### Tool Calling with vLLM
|
| 280 |
+
|
| 281 |
+
To support agent-based workflows and function calling capabilities, this model includes specialized parsing mechanisms for handling tool calls and internal reasoning steps.
|
| 282 |
+
|
| 283 |
+
For a complete working example of how to implement and use these features in an agent setting, please refer to our full agent implementation on GitHub:
|
| 284 |
+
🔗 [Hunyuan A13B Agent Example](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
|
| 285 |
+
|
| 286 |
+
When deploying the model using **vLLM**, the following parameters can be used to configure the tool parsing behavior:
|
| 287 |
+
|
| 288 |
+
| Parameter | Value |
|
| 289 |
+
|--------------------------|-----------------------------------------------------------------------|
|
| 290 |
+
| `--tool-parser-plugin` | [Local Hunyuan A13B Tool Parser File](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
|
| 291 |
+
| `--tool-call-parser` | `hunyuan` |
|
| 292 |
+
|
| 293 |
+
These settings enable vLLM to correctly interpret and route tool calls generated by the model according to the expected format.
|
| 294 |
+
|
| 295 |
+
### Reasoning parser
|
| 296 |
+
|
| 297 |
+
vLLM reasoning parser support on Hunyuan A13B model is under development.
|
| 298 |
+
|
| 299 |
### SGLang
|
| 300 |
|
| 301 |
#### Docker Image
|