skips the thinking process

by muzizon - opened 16 days ago

16 days ago

I am facing an issue with the DeepSeek r1 AWQ model deployed using vLLM. In stream mode, the model consistently skips the thinking process and outputs only "\n\n" instead of generating meaningful responses.

Has anyone else encountered this behavior? Any suggestions on how to resolve this?

v2ray

Cognitive Computations org 16 days ago

Which vLLM version are you using, what's your startup command, and what are the GPUs that you're using?

muzizon

14 days ago

Thanks for your help! 😊
vLLM Version: 0.7.2
Startup Command: python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 32768 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ --enable-reasoning --reasoning-parser deepseek_r1
GPU Configuration: 8 * A800

v2ray

Cognitive Computations org 14 days ago

--enable-reasoning --reasoning-parser deepseek_r1 This will make the streaming output format slightly different, if you don't want to add special support for this, simply remove these 2 flags and it will work.

muzizon

14 days ago

thanks I'll try it out

traphix

13 days ago

Thanks for your help! 😊
vLLM Version: 0.7.2
Startup Command: python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 32768 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ --enable-reasoning --reasoning-parser deepseek_r1
GPU Configuration: 8 * A800

Does A100 support "--kv-cache-dtype fp8_e5m2"?

v2ray

Cognitive Computations org 13 days ago

@traphix Yes, but it would be slower than H100.

xueshuai

11 days ago

•

edited 11 days ago

thanks I'll try it out

have been solved?

muzizon

10 days ago

•

edited 10 days ago

thanks I'll try it out

have been solved?

Yeah.
According to the official DeepSeek documentation:
Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\n\n") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\n" at the beginning of every output.

The frequency of triggering thinking is now normal.
However, there are still some issues, as the model's outputs often turn into gibberish.

v2ray

Cognitive Computations org 7 days ago

as the model's outputs often turn into gibberish

Reduce temperature and top p.

Closed as the main issue is solved.

v2ray changed discussion status to closed 7 days ago

ShiningMaker

4 days ago

@traphix Yes, but it would be slower than H100.

Hello, I encountered this issue: if I don't add the --kv-cache-dtype fp8_e5m2, I need to reduce the max-model-len to 8192 to avoid OOM (Out of Memory) errors when deploying on 8xH20 gpu. Theoretically, it shouldn't be like this, right?

v2ray

Cognitive Computations org 4 days ago

@ShiningMaker Try using the latest dev version by building from source, it contains MLA for AWQ which massively saves VRAM usage.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment