🚀 Build a Qwen 2.5 VL API endpoint with Hugging Face spaces and Docker!

Community Article Published January 29, 2025

Vision-Language models are making waves, but no provider has yet to offer an API-ready deployment of Qwen2.5-VL. In this guide, we’ll build a proof-of-concept API, hosting Qwen2.5-VL-7B-Instruct on Hugging Face Spaces using Docker. Let’s get hands-on and deploy a model that understands images and text in a single API call!

📌 What you’ll get: A live API that takes an image URL and text prompt, processes them with Qwen2.5-VL, and returns a response.

1️⃣ Setup Your Space

Head over to Hugging Face Spaces and create a new Space. Choose Docker as the SDK. Make sure you attach a GPU to the space for faster inference!

2️⃣ Write the FastAPI Server

We’ll expose an endpoint that accepts image_url and prompt, runs inference with Qwen2.5-VL, and returns a text response.

📜 main.py

from fastapi import FastAPI, Query
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

app = FastAPI()

checkpoint = "Qwen/Qwen2.5-VL-3B-Instruct"
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained(
    checkpoint,
    min_pixels=min_pixels,
    max_pixels=max_pixels
)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    checkpoint,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    # attn_implementation="flash_attention_2",
)

@app.get("/")
def read_root():
    return {"message": "API is live. Use the /predict endpoint."}

@app.get("/predict")
def predict(image_url: str = Query(...), prompt: str = Query(...)):
    messages = [
        {"role": "system", "content": "You are a helpful assistant with vision abilities."},
        {"role": "user", "content": [{"type": "image", "image": image_url}, {"type": "text", "text": prompt}]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    ).to(model.device)
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    output_texts = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    return {"response": output_texts[0]}

🔹 Endpoint: GET /predict?image_url=<URL>&prompt=<TEXT>
🔹 Returns: A generated response based on the image and prompt

3️⃣ Build the Dockerfile

We’ll install dependencies directly in the Dockerfile, eliminating the need for a requirements.txt.

📜 Dockerfile

# Use Python 3.12 as the base image
FROM python:3.12

# Install system dependencies
RUN apt-get update && apt-get install -y \
    ffmpeg \
    git \
    && rm -rf /var/lib/apt/lists/*

# Create a non-root user
RUN useradd -m -u 1000 user
WORKDIR /app

# Install Python dependencies directly
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir \
        torch \
        torchvision \
        git+https://github.com/huggingface/transformers \
        accelerate \
        qwen-vl-utils[decord]==0.0.8 \
        fastapi \
        uvicorn[standard] 

# Copy application files
COPY --chown=user . /app

# Switch to the non-root user
USER user

# Set environment variables
ENV HOME=/home/user \
    PATH=/home/user/.local/bin:$PATH

# Command to run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

A successful deployment would look like this 👇🏻

4️⃣ Test the API

✅ Using `curl`

curl -G "https://<uname>-<spacename>.hf.space/predict" \
     --data-urlencode "image_url=https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" \
     --data-urlencode "prompt=Describe this image."

✅ Using Python

import requests

url = "https://<uname>-<spacename>.hf.space/predict"

# Define the parameters
params = {
    "image_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
    "prompt": "describe"
}

# Send the GET request
response = requests.get(url, params=params)

if response.status_code == 200:
    print("Response:", response.json())
else:
    print("Error:", response.status_code, response.text)

🎯 Final Thoughts

In just a few steps, we deployed Qwen2.5-VL-3B-Instruct as an API on Hugging Face Spaces with FastAPI & Docker.

🔹 Next Steps? Feel free to fork this space and try out new things!

Add a frontend to interact with the API!
Optimize model inference for faster responses.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote