Spaces:

yusufs
/

vllm-inference

Paused

App Files Files

vllm-inference / README.md

yusufs

feat(add-model): always download model during build, it will be cached in the consecutive builds

8679a35 4 months ago

preview code

raw

history blame

1.73 kB

	---
	title: Deploy VLLM
	emoji: 🐢
	colorFrom: blue
	colorTo: blue
	sdk: docker
	pinned: false
	---


	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


	```shell
	poetry export -f requirements.txt --output requirements.txt --without-hashes
	```

	* The `HUGGING_FACE_HUB_TOKEN` and `HF_TOKEN` must exist during runtime (use the same value, it must have read permission to the model.)

	## VLLM OpenAI Compatible API Server

	> References: https://huggingface.co/spaces/sofianhw/ai/tree/c6527a750644a849b6705bb6fe2fcea4e54a8196

	This `api_server.py` file is exact copy version from https://github.com/vllm-project/vllm/blob/v0.6.4.post1/vllm/entrypoints/openai/api_server.py

	Changes (use diff tool to see the exact changes of the file):

	* [x] change everything route in api_server.py that start (“/v1/xxx”) to (“/api/v1/xxx”).
	and just run the python api_server.py with arguments. https://discuss.huggingface.co/t/run-vllm-docker-on-space/70228/5?u=yusufs


	## Documentation about config

	* https://github.com/vllm-project/vllm/blob/v0.6.4.post1/vllm/utils.py#L1207-L1221

	```shell
	"serve,chat,complete",
	"facebook/opt-12B",
	'--config', 'config.yaml',
	'-tp', '2'
	```

	The yaml is equivalent with argument flag params. Consider passing using flag params that defined here for better documentation:
	https://github.com/vllm-project/vllm/blob/v0.6.4.post1/vllm/entrypoints/openai/cli_args.py#L77-L237

	Other arguments is the same as LLM class such as `--max-model-len`, `--dtype`, or `--otlp-traces-endpoint`
	* https://github.com/vllm-project/vllm/blob/v0.6.4/vllm/config.py#L1061-L1086
	* https://github.com/vllm-project/vllm/blob/v0.6.4.post1/vllm/engine/arg_utils.py#L221-L913