sglang inference issue

#1
by su400 - opened

[2025-02-15 23:14:31 TP9] Scheduler hit an exception: Traceback (most recent call last):
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 240, in init
self.tp_worker = TpWorkerClass(
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 194, in init
self.load_model()
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 317, in load_model
self.model = get_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 137, in _initialize_model
quant_config = _get_quantization_config(model_config, load_config)
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 107, in _get_quantization_config
quant_config = get_quant_config(model_config, load_config)
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/weight_utils.py", line 153, in get_quant_config
return quant_cls.from_config(hf_quant_config)
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/gptq.py", line 73, in from_config
desc_act = cls.get_from_keys(config, ["desc_act"])
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/base_config.py", line 114, in get_from_keys
raise ValueError(f"Cannot find any of {keys} in the model's "
ValueError: Cannot find any of ['desc_act'] in the model's quantization config.

Open Platform for Enterprise AI org

sorry, we manually convert it from auto-round format.
Please sync with the latest model, the only changes are some configurations.
https://huggingface.co/OPEA/DeepSeek-R1-int4-gptq-sym-inc/commit/cf5a1db4237e16f3c756a8fe039d64bd81c2d7d8

Open Platform for Enterprise AI org
edited 9 days ago

Additionally, please be mindful of the overflow issue mentioned in the README. If the kernel of the framework you are using does not support the bf16 compute dtype, it may produce unexpected results

(SGLang) kkk@kk02:~/ai/sglang$ python3 -m sglang.launch_server --model-path /home/kkk/ai/models/DeepSeek-R1-int4-gptq-sym-inc --tp 16 --dist-init-addr 192.168.0.177:5000 --trust-remote-code --host 0.0.0.0 --port 9997 --context-length 32768 --dtype float16 --mem-fraction 0.8 --served-model-name DeepSeek-R1 --disable-mla --nnodes 2 --node-rank 0 --disable-cuda-graph
INFO 02-16 00:43:26 init.py:190] Automatically detected platform cuda.
[2025-02-16 00:43:32] server_args=ServerArgs(model_path='/home/kkk/ai/models/DeepSeek-R1-int4-gptq-sym-inc', tokenizer_path='/home/kkk/ai/models/DeepSeek-R1-int4-gptq-sym-inc', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=32768, device='cuda', served_model_name='DeepSeek-R1', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=9997, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=16, stream_interval=1, stream_output=False, random_seed=880885632, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr='192.168.0.177:5000', nnodes=2, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=True, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False)
INFO 02-16 00:43:32 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 02-16 00:43:36 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:43:36 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:43:36 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:43:36 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:43:36 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:43:36 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:43:36 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:43:36 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:43:36 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:43:41 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 02-16 00:43:41 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 02-16 00:43:41 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-16 00:43:42 TP5] Init torch distributed begin.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-16 00:43:42 TP0] Init torch distributed begin.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-16 00:43:42 TP2] Init torch distributed begin.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-16 00:43:42 TP4] Init torch distributed begin.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-16 00:43:42 TP7] Init torch distributed begin.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-16 00:43:42 TP1] Init torch distributed begin.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-16 00:43:42 TP3] Init torch distributed begin.
INFO 02-16 00:43:42 gptq_marlin.py:111] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-16 00:43:42 TP6] Init torch distributed begin.
[2025-02-16 00:43:43 TP0] sglang is using nccl==2.25.1
[2025-02-16 00:43:43 TP4] sglang is using nccl==2.25.1
[2025-02-16 00:43:43 TP1] sglang is using nccl==2.25.1
[2025-02-16 00:43:43 TP3] sglang is using nccl==2.25.1
[2025-02-16 00:43:43 TP6] sglang is using nccl==2.25.1
[2025-02-16 00:43:43 TP7] sglang is using nccl==2.25.1
[2025-02-16 00:43:43 TP5] sglang is using nccl==2.25.1
[2025-02-16 00:43:43 TP2] sglang is using nccl==2.25.1
[2025-02-16 00:43:43 TP3] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:43:43 TP2] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:43:43 TP1] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:43:43 TP4] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:43:43 TP5] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:43:43 TP0] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:43:43 TP6] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:43:43 TP7] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:43:43 TP0] Load weight begin. avail mem=46.70 GB
[2025-02-16 00:43:43 TP5] Load weight begin. avail mem=46.70 GB
[2025-02-16 00:43:43 TP1] Load weight begin. avail mem=46.70 GB
[2025-02-16 00:43:43 TP3] Load weight begin. avail mem=46.70 GB
[2025-02-16 00:43:43 TP4] Load weight begin. avail mem=46.70 GB
[2025-02-16 00:43:43 TP2] Load weight begin. avail mem=46.70 GB
[2025-02-16 00:43:43 TP6] Load weight begin. avail mem=46.70 GB
[2025-02-16 00:43:43 TP7] Load weight begin. avail mem=46.70 GB
INFO 02-16 00:43:43 gptq_marlin.py:202] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-16 00:43:43 gptq_marlin.py:202] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-16 00:43:43 gptq_marlin.py:202] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-16 00:43:44 gptq_marlin.py:202] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-16 00:43:44 gptq_marlin.py:202] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-16 00:43:44 gptq_marlin.py:202] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-16 00:43:44 gptq_marlin.py:202] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-16 00:43:44 gptq_marlin.py:202] Using MarlinLinearKernel for GPTQMarlinLinearMethod
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Loading safetensors checkpoint shards: 0% Completed | 0/71 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 1% Completed | 1/71 [00:00<00:38, 1.81it/s]
Loading safetensors checkpoint shards: 3% Completed | 2/71 [00:01<00:38, 1.81it/s]
Loading safetensors checkpoint shards: 4% Completed | 3/71 [00:01<00:38, 1.78it/s]
Loading safetensors checkpoint shards: 6% Completed | 4/71 [00:02<00:38, 1.73it/s]
Loading safetensors checkpoint shards: 7% Completed | 5/71 [00:02<00:38, 1.71it/s]
Loading safetensors checkpoint shards: 8% Completed | 6/71 [00:03<00:37, 1.74it/s]
Loading safetensors checkpoint shards: 10% Completed | 7/71 [00:03<00:36, 1.78it/s]
Loading safetensors checkpoint shards: 11% Completed | 8/71 [00:04<00:35, 1.78it/s]
Loading safetensors checkpoint shards: 13% Completed | 9/71 [00:05<00:33, 1.84it/s]
Loading safetensors checkpoint shards: 14% Completed | 10/71 [00:05<00:31, 1.93it/s]
Loading safetensors checkpoint shards: 15% Completed | 11/71 [00:05<00:30, 1.99it/s]
Loading safetensors checkpoint shards: 17% Completed | 12/71 [00:06<00:29, 2.02it/s]
Loading safetensors checkpoint shards: 18% Completed | 13/71 [00:06<00:28, 2.06it/s]
Loading safetensors checkpoint shards: 20% Completed | 14/71 [00:07<00:27, 2.11it/s]
Loading safetensors checkpoint shards: 21% Completed | 15/71 [00:07<00:26, 2.15it/s]
Loading safetensors checkpoint shards: 23% Completed | 16/71 [00:08<00:25, 2.13it/s]
Loading safetensors checkpoint shards: 24% Completed | 17/71 [00:08<00:25, 2.12it/s]
Loading safetensors checkpoint shards: 25% Completed | 18/71 [00:09<00:25, 2.11it/s]
Loading safetensors checkpoint shards: 27% Completed | 19/71 [00:09<00:24, 2.13it/s]
Loading safetensors checkpoint shards: 28% Completed | 20/71 [00:10<00:24, 2.12it/s]
Loading safetensors checkpoint shards: 30% Completed | 21/71 [00:10<00:23, 2.12it/s]
Loading safetensors checkpoint shards: 31% Completed | 22/71 [00:11<00:22, 2.16it/s]
Loading safetensors checkpoint shards: 32% Completed | 23/71 [00:11<00:22, 2.17it/s]
Loading safetensors checkpoint shards: 34% Completed | 24/71 [00:12<00:21, 2.16it/s]
Loading safetensors checkpoint shards: 35% Completed | 25/71 [00:12<00:21, 2.19it/s]
Loading safetensors checkpoint shards: 37% Completed | 26/71 [00:12<00:20, 2.16it/s]
Loading safetensors checkpoint shards: 38% Completed | 27/71 [00:13<00:20, 2.18it/s]
Loading safetensors checkpoint shards: 39% Completed | 28/71 [00:13<00:20, 2.13it/s]
Loading safetensors checkpoint shards: 41% Completed | 29/71 [00:14<00:20, 2.10it/s]
Loading safetensors checkpoint shards: 42% Completed | 30/71 [00:14<00:20, 2.04it/s]
Loading safetensors checkpoint shards: 44% Completed | 31/71 [00:15<00:19, 2.06it/s]
Loading safetensors checkpoint shards: 45% Completed | 32/71 [00:15<00:18, 2.10it/s]
Loading safetensors checkpoint shards: 46% Completed | 33/71 [00:16<00:17, 2.13it/s]
Loading safetensors checkpoint shards: 48% Completed | 34/71 [00:16<00:17, 2.12it/s]
Loading safetensors checkpoint shards: 49% Completed | 35/71 [00:17<00:17, 2.08it/s]
Loading safetensors checkpoint shards: 51% Completed | 36/71 [00:17<00:16, 2.09it/s]
Loading safetensors checkpoint shards: 52% Completed | 37/71 [00:18<00:16, 2.06it/s]
Loading safetensors checkpoint shards: 54% Completed | 38/71 [00:18<00:15, 2.08it/s]
Loading safetensors checkpoint shards: 55% Completed | 39/71 [00:19<00:15, 2.12it/s]
Loading safetensors checkpoint shards: 56% Completed | 40/71 [00:19<00:14, 2.14it/s]
Loading safetensors checkpoint shards: 58% Completed | 41/71 [00:20<00:13, 2.14it/s]
Loading safetensors checkpoint shards: 59% Completed | 42/71 [00:20<00:13, 2.08it/s]
Loading safetensors checkpoint shards: 61% Completed | 43/71 [00:20<00:12, 2.23it/s]
Loading safetensors checkpoint shards: 62% Completed | 44/71 [00:21<00:12, 2.17it/s]
Loading safetensors checkpoint shards: 63% Completed | 45/71 [00:21<00:12, 2.14it/s]
Loading safetensors checkpoint shards: 65% Completed | 46/71 [00:22<00:11, 2.11it/s]
Loading safetensors checkpoint shards: 66% Completed | 47/71 [00:22<00:11, 2.08it/s]
Loading safetensors checkpoint shards: 68% Completed | 48/71 [00:23<00:11, 2.09it/s]
Loading safetensors checkpoint shards: 69% Completed | 49/71 [00:23<00:10, 2.02it/s]
Loading safetensors checkpoint shards: 70% Completed | 50/71 [00:24<00:10, 2.06it/s]
Loading safetensors checkpoint shards: 72% Completed | 51/71 [00:24<00:09, 2.00it/s]
Loading safetensors checkpoint shards: 73% Completed | 52/71 [00:25<00:09, 1.94it/s]
Loading safetensors checkpoint shards: 75% Completed | 53/71 [00:25<00:09, 1.94it/s]
Loading safetensors checkpoint shards: 76% Completed | 54/71 [00:26<00:08, 1.96it/s]
Loading safetensors checkpoint shards: 77% Completed | 55/71 [00:26<00:08, 1.99it/s]
Loading safetensors checkpoint shards: 79% Completed | 56/71 [00:27<00:07, 2.00it/s]
Loading safetensors checkpoint shards: 80% Completed | 57/71 [00:27<00:06, 2.31it/s]
Loading safetensors checkpoint shards: 82% Completed | 58/71 [00:28<00:05, 2.20it/s]
Loading safetensors checkpoint shards: 83% Completed | 59/71 [00:28<00:05, 2.15it/s]
Loading safetensors checkpoint shards: 85% Completed | 60/71 [00:29<00:05, 2.10it/s]
Loading safetensors checkpoint shards: 86% Completed | 61/71 [00:29<00:04, 2.11it/s]
Loading safetensors checkpoint shards: 87% Completed | 62/71 [00:30<00:04, 2.09it/s]
Loading safetensors checkpoint shards: 89% Completed | 63/71 [00:30<00:03, 2.08it/s]
Loading safetensors checkpoint shards: 90% Completed | 64/71 [00:31<00:03, 2.01it/s]
Loading safetensors checkpoint shards: 92% Completed | 65/71 [00:31<00:02, 2.03it/s]
Loading safetensors checkpoint shards: 93% Completed | 66/71 [00:32<00:02, 2.06it/s]
Loading safetensors checkpoint shards: 94% Completed | 67/71 [00:32<00:01, 2.05it/s]
Loading safetensors checkpoint shards: 96% Completed | 68/71 [00:33<00:01, 2.06it/s]
Loading safetensors checkpoint shards: 97% Completed | 69/71 [00:33<00:00, 2.07it/s]
Loading safetensors checkpoint shards: 99% Completed | 70/71 [00:34<00:00, 2.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 71/71 [00:34<00:00, 2.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 71/71 [00:34<00:00, 2.05it/s]

[2025-02-16 00:44:20 TP3] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=25.35 GB
[2025-02-16 00:44:20 TP6] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=25.35 GB
[2025-02-16 00:44:21 TP7] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=25.35 GB
[2025-02-16 00:44:21 TP2] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=25.35 GB
[2025-02-16 00:44:21 TP0] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=25.46 GB
[2025-02-16 00:44:22 TP1] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=25.35 GB
[2025-02-16 00:44:22 TP4] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=25.35 GB
[2025-02-16 00:44:22 TP5] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=25.35 GB
[2025-02-16 00:44:22 TP6] KV Cache is allocated. K size: 7.49 GB, V size: 7.49 GB.
[2025-02-16 00:44:22 TP6] Memory pool end. avail mem=10.08 GB
[2025-02-16 00:44:22 TP1] KV Cache is allocated. K size: 7.49 GB, V size: 7.49 GB.
[2025-02-16 00:44:22 TP1] Memory pool end. avail mem=10.08 GB
[2025-02-16 00:44:22 TP3] KV Cache is allocated. K size: 7.49 GB, V size: 7.49 GB.
[2025-02-16 00:44:22 TP3] Memory pool end. avail mem=10.08 GB
[2025-02-16 00:44:22 TP0] KV Cache is allocated. K size: 7.49 GB, V size: 7.49 GB.
[2025-02-16 00:44:22 TP0] Memory pool end. avail mem=10.19 GB
[2025-02-16 00:44:22 TP2] KV Cache is allocated. K size: 7.49 GB, V size: 7.49 GB.
[2025-02-16 00:44:22 TP4] KV Cache is allocated. K size: 7.49 GB, V size: 7.49 GB.
[2025-02-16 00:44:22 TP5] KV Cache is allocated. K size: 7.49 GB, V size: 7.49 GB.
[2025-02-16 00:44:22 TP4] Memory pool end. avail mem=10.08 GB
[2025-02-16 00:44:22 TP5] Memory pool end. avail mem=10.08 GB
[2025-02-16 00:44:22 TP2] Memory pool end. avail mem=10.08 GB
[2025-02-16 00:44:22 TP7] KV Cache is allocated. K size: 7.49 GB, V size: 7.49 GB.
[2025-02-16 00:44:22 TP7] Memory pool end. avail mem=10.08 GB
[2025-02-16 00:44:23 TP0] max_total_num_tokens=32188, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=32768
[2025-02-16 00:44:23 TP2] max_total_num_tokens=32188, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=32768
[2025-02-16 00:44:23 TP1] max_total_num_tokens=32188, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=32768
[2025-02-16 00:44:23 TP6] max_total_num_tokens=32188, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=32768
[2025-02-16 00:44:23 TP4] max_total_num_tokens=32188, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=32768
[2025-02-16 00:44:23 TP5] max_total_num_tokens=32188, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=32768
[2025-02-16 00:44:23 TP7] max_total_num_tokens=32188, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=32768
[2025-02-16 00:44:23 TP3] max_total_num_tokens=32188, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=32768
[2025-02-16 00:44:23] INFO: Started server process [209100]
[2025-02-16 00:44:23] INFO: Waiting for application startup.
[2025-02-16 00:44:23] INFO: Application startup complete.
[2025-02-16 00:44:23] INFO: Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit)
[2025-02-16 00:44:24] INFO: 127.0.0.1:56694 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-16 00:44:24 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-16 00:44:27 TP0] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
self.forward_thread_func_()
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/torch/utils/contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func

logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward
return self.forward_extend(forward_batch)
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend
return self.model.forward(
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 868, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 829, in forward
hidden_states, residual = layer(
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states = self.mlp(hidden_states)
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kkk/miniconda3/envs/SGLang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kkk/ai/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward
final_hidden_states = self.quant_method.apply(
TypeError: GPTQMarlinMoEMethod.apply() got an unexpected keyword argument 'correction_bias'

[2025-02-16 00:44:27] Received sigquit from a child proces. It usually means the child failed.
已杀死

(SGLang) kkk@kkk:~/ai/sglang$ python3 -m sglang.launch_server --model-path /home/kkk/ai/models/DeepSeek-R1-int4-gptq-sym-inc --tp 16 --dist-init-addr 192.168.0.177:5000 --trust-remote-code --host 0.0.0.0 --port 9997 --context-length 32768 --dtype float16 --mem-fraction 0.8 --served-model-name DeepSeek-R1 --disable-mla --nnodes 2 --node-rank 1 --disable-cuda-graph --quantization gptq
INFO 02-16 00:41:05 init.py:190] Automatically detected platform cuda.
[2025-02-16 00:41:08] server_args=ServerArgs(model_path='/home/kkk/ai/models/DeepSeek-R1-int4-gptq-sym-inc', tokenizer_path='/home/kkk/ai/models/DeepSeek-R1-int4-gptq-sym-inc', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, quantization='gptq', context_length=32768, device='cuda', served_model_name='DeepSeek-R1', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=9997, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=16, stream_interval=1, stream_output=False, random_seed=774239283, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr='192.168.0.177:5000', nnodes=2, node_rank=1, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=True, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False)
INFO 02-16 00:41:10 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:41:10 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:41:10 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:41:10 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:41:10 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:41:10 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:41:10 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:41:10 init.py:190] Automatically detected platform cuda.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP9] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP15] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP9] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-02-16 00:41:14 TP9] Init torch distributed begin.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP8] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP14] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP15] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-02-16 00:41:14 TP15] Init torch distributed begin.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP11] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP13] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP12] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP10] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP8] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-02-16 00:41:14 TP8] Init torch distributed begin.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP14] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-02-16 00:41:14 TP14] Init torch distributed begin.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP11] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-02-16 00:41:14 TP11] Init torch distributed begin.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP13] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-02-16 00:41:14 TP13] Init torch distributed begin.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP12] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-02-16 00:41:14 TP12] Init torch distributed begin.
INFO 02-16 00:41:14 gptq_marlin.py:115] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
[2025-02-16 00:41:14 TP10] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-02-16 00:41:14 TP10] Init torch distributed begin.
[2025-02-16 00:41:15 TP8] sglang is using nccl==2.25.1
[2025-02-16 00:41:15 TP12] sglang is using nccl==2.25.1
[2025-02-16 00:41:15 TP9] sglang is using nccl==2.25.1
[2025-02-16 00:41:15 TP10] sglang is using nccl==2.25.1
[2025-02-16 00:41:15 TP11] sglang is using nccl==2.25.1
[2025-02-16 00:41:15 TP14] sglang is using nccl==2.25.1
[2025-02-16 00:41:15 TP13] sglang is using nccl==2.25.1
[2025-02-16 00:41:15 TP15] sglang is using nccl==2.25.1
[2025-02-16 00:41:15 TP15] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:41:15 TP14] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:41:15 TP12] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:41:15 TP11] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:41:15 TP13] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:41:15 TP10] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:41:15 TP9] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:41:15 TP8] Custom allreduce is disabled because this process group spans across nodes.
[2025-02-16 00:41:16 TP15] Load weight begin. avail mem=46.99 GB
[2025-02-16 00:41:16 TP12] Load weight begin. avail mem=46.99 GB
[2025-02-16 00:41:16 TP9] Load weight begin. avail mem=46.74 GB
[2025-02-16 00:41:16 TP13] Load weight begin. avail mem=46.99 GB
[2025-02-16 00:41:16 TP11] Load weight begin. avail mem=46.74 GB
[2025-02-16 00:41:16 TP14] Load weight begin. avail mem=45.49 GB
[2025-02-16 00:41:16 TP10] Load weight begin. avail mem=46.99 GB
[2025-02-16 00:41:16 TP8] Load weight begin. avail mem=46.74 GB
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
[2025-02-16 00:41:16 TP8] Scheduler hit an exception: Traceback (most recent call last):
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 240, in init
self.tp_worker = TpWorkerClass(
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 194, in init
self.load_model()
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 317, in load_model
self.model = get_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 847, in init
self.model = DeepseekV2Model(config, quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 808, in init
[
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 809, in
DeepseekV2DecoderLayer(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 146, in init
self.experts = MoEImpl(
File "/home/kkk/ai/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 295, in init
assert self.quant_method is not None
AssertionError

[2025-02-16 00:41:16] Received sigquit from a child proces. It usually means the child failed.
[2025-02-16 00:41:16 TP9] Scheduler hit an exception: Traceback (most recent call last):
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 240, in init
self.tp_worker = TpWorkerClass(
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 194, in init
self.load_model()
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 317, in load_model
self.model = get_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 847, in init
self.model = DeepseekV2Model(config, quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 808, in init
[
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 809, in
DeepseekV2DecoderLayer(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 146, in init
self.experts = MoEImpl(
File "/home/kkk/ai/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 295, in init
assert self.quant_method is not None
AssertionError

[2025-02-16 00:41:16] Received sigquit from a child proces. It usually means the child failed.
[2025-02-16 00:41:16 TP15] Scheduler hit an exception: Traceback (most recent call last):
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 240, in init
self.tp_worker = TpWorkerClass(
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 194, in init
self.load_model()
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 317, in load_model
self.model = get_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 847, in init
self.model = DeepseekV2Model(config, quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 808, in init
[
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 809, in
DeepseekV2DecoderLayer(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 146, in init
self.experts = MoEImpl(
File "/home/kkk/ai/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 295, in init
assert self.quant_method is not None
AssertionError

[2025-02-16 00:41:16] Received sigquit from a child proces. It usually means the child failed.
[2025-02-16 00:41:16 TP11] Scheduler hit an exception: Traceback (most recent call last):
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 240, in init
self.tp_worker = TpWorkerClass(
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 194, in init
self.load_model()
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 317, in load_model
self.model = get_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 847, in init
self.model = DeepseekV2Model(config, quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 808, in init
[
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 809, in
DeepseekV2DecoderLayer(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 146, in init
self.experts = MoEImpl(
File "/home/kkk/ai/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 295, in init
assert self.quant_method is not None
AssertionError

[2025-02-16 00:41:16 TP14] Scheduler hit an exception: Traceback (most recent call last):
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 240, in init
self.tp_worker = TpWorkerClass(
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 194, in init
self.load_model()
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 317, in load_model
self.model = get_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 847, in init
self.model = DeepseekV2Model(config, quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 808, in init
[
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 809, in
DeepseekV2DecoderLayer(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 146, in init
self.experts = MoEImpl(
File "/home/kkk/ai/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 295, in init
assert self.quant_method is not None
AssertionError

[2025-02-16 00:41:16] Received sigquit from a child proces. It usually means the child failed.
[2025-02-16 00:41:16 TP13] Scheduler hit an exception: Traceback (most recent call last):
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 240, in init
self.tp_worker = TpWorkerClass(
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 194, in init
self.load_model()
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 317, in load_model
self.model = get_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 847, in init
self.model = DeepseekV2Model(config, quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 808, in init
[
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 809, in
DeepseekV2DecoderLayer(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 146, in init
self.experts = MoEImpl(
File "/home/kkk/ai/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 295, in init
assert self.quant_method is not None
AssertionError

[2025-02-16 00:41:16 TP12] Scheduler hit an exception: Traceback (most recent call last):
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 240, in init
self.tp_worker = TpWorkerClass(
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 194, in init
self.load_model()
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 317, in load_model
self.model = get_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 847, in init
self.model = DeepseekV2Model(config, quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 808, in init
[
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 809, in
DeepseekV2DecoderLayer(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 146, in init
self.experts = MoEImpl(
File "/home/kkk/ai/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 295, in init
assert self.quant_method is not None
AssertionError

[2025-02-16 00:41:16 TP10] Scheduler hit an exception: Traceback (most recent call last):
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/scheduler.py", line 240, in init
self.tp_worker = TpWorkerClass(
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/kkk/ai/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 194, in init
self.load_model()
File "/home/kkk/ai/sglang/python/sglang/srt/model_executor/model_runner.py", line 317, in load_model
self.model = get_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/home/kkk/ai/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 847, in init
self.model = DeepseekV2Model(config, quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 808, in init
[
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 809, in
DeepseekV2DecoderLayer(
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/home/kkk/ai/sglang/python/sglang/srt/models/deepseek_v2.py", line 146, in init
self.experts = MoEImpl(
File "/home/kkk/ai/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 295, in init
assert self.quant_method is not None
AssertionError

[2025-02-16 00:41:16] Received sigquit from a child proces. It usually means the child failed.

Open Platform for Enterprise AI org
edited 9 days ago

I guess this may be related to the framework issue but I am not quite sure.
https://github.com/vllm-project/vllm/issues/7494

Besides, as suggested in the readme, you'd better not use the marlin kernel.

cicdatopea changed discussion title from ValueError: Cannot find any of ['desc_act'] in the model's quantization config. to sglang inference issue

Thank you very much. I used cognitive-computations/DeepSeek-R1-AWQ and was able to reason successfully in the same environment. However, there is a problem with the AWQ model, which is that it will reply with large, unrelated statements, but not garbled ones. I hope your GPTQ can be successful. The MOE model may have a high difficulty in quantification and there may not be a substitute, but my machine can only run 4-bit quantification.

Open Platform for Enterprise AI org

Could you follow the model and test the --quantization moe_wna16 option in vLLM? Since we don't have an 8×80GB GPU , we are unable to test it ourselves.

You could also test the model in Transformers or on the CPU to check if it meets your needs first. Please note that when using Transformers with CUDA, the model's output may not be as good as on the CPU due to overflow issues.

Sign up or log in to comment