cognitivecomputations/DeepSeek-R1-AWQ · Can't get 48 TPS on 8x H800

1 day ago

I am using vllm official nightly build 7f6bae561c210da06af5d40e8861b0d2ddfe339c, and using the exact command

VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --max-num-batched-tokens 65536 --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.97 --dtype float16 --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ

However the max bs1 TPS is ~36 TPS, which is far from 48.
Are you using the cuda moe_wna16 PR with MLA disabled?

v2ray

Cognitive Computations org 1 day ago

I merged this PR, it boosts the performance a lot.

v2ray changed discussion status to closed 1 day ago