Serve with vLLM
Has anyone been able to serve the model with vLLM?
In my practice, for every request, vLLM is refreshing the processor, which is supposed to be cached. The service will then be down due to HTTP Error 429 caused by repetitive HEAD operations. No idea what breaks the LRU cache.
After debug, it seems that get_processor always raises AttributeError('Qwen2TokenizerFast has no attribute start_image_token')
Thank you for your interest in our work.
vLLM does support the GitHub-format InternVL. However, the error you encountered seems to come from the preprocessor assuming the model is in HuggingFace format. I suggest trying a lower version of vLLM (e.g., 0.8.5.post1 for Qwen3 or 0.10.1 GPT-OSS), or using our HF-format checkpoint. If the issue persists, we recommend deploying with LMDeploy.