Spaces:

yusufs
/

vllm-inference

Paused

yusufs commited on Nov 29, 2024

Commit

6dac0d0

1 Parent(s): 0f3cd25

docs(sailor): add not about minimum resources of sailor

Files changed (1) hide show

run-sailor.sh CHANGED Viewed

@@ -12,6 +12,8 @@ printf "Running sail/Sailor-4B-Chat using vLLM OpenAI compatible API Server at p
 # ERROR 11-27 15:32:10 engine.py:366] The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (7536). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
 # After increasing gpu utilization to 0.9, the maximum token for this model is: 9456
 # 7536tokens÷1.2=6280words.
 # 6280words÷500words/page=12.56pages. (For single-spaced)

 # ERROR 11-27 15:32:10 engine.py:366] The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (7536). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
 # After increasing gpu utilization to 0.9, the maximum token for this model is: 9456
+# Using NVIDIA 1xL4 (8vCPU 30GB RAM 24GB VRAM) still only support 23712 tokens.
+# Using NVIDIA 1xL40S (8vCPU 62GB RAM 48GB VRAM) can support 32768 token. (Increasing RAM not works, only increasing VRAM works).
 # 7536tokens÷1.2=6280words.
 # 6280words÷500words/page=12.56pages. (For single-spaced)