Spaces:
Runtime error
Runtime error
# Using pipelines for a webserver | |
<Tip> | |
Creating an inference engine is a complex topic, and the "best" solution | |
will most likely depend on your problem space. Are you on CPU or GPU? Do | |
you want the lowest latency, the highest throughput, support for | |
many models, or just highly optimize 1 specific model? | |
There are many ways to tackle this topic, so what we are going to present is a good default | |
to get started which may not necessarily be the most optimal solution for you. | |
</Tip> | |
The key thing to understand is that we can use an iterator, just like you would [on a | |
dataset](pipeline_tutorial#using-pipelines-on-a-dataset), since a webserver is basically a system that waits for requests and | |
treats them as they come in. | |
Usually webservers are multiplexed (multithreaded, async, etc..) to handle various | |
requests concurrently. Pipelines on the other hand (and mostly the underlying models) | |
are not really great for parallelism; they take up a lot of RAM, so it's best to give them all the available resources when they are running or it's a compute-intensive job. | |
We are going to solve that by having the webserver handle the light load of receiving | |
and sending requests, and having a single thread handling the actual work. | |
This example is going to use `starlette`. The actual framework is not really | |
important, but you might have to tune or change the code if you are using another | |
one to achieve the same effect. | |
Create `server.py`: | |
```py | |
from starlette.applications import Starlette | |
from starlette.responses import JSONResponse | |
from starlette.routing import Route | |
from transformers import pipeline | |
import asyncio | |
async def homepage(request): | |
payload = await request.body() | |
string = payload.decode("utf-8") | |
response_q = asyncio.Queue() | |
await request.app.model_queue.put((string, response_q)) | |
output = await response_q.get() | |
return JSONResponse(output) | |
async def server_loop(q): | |
pipe = pipeline(model="bert-base-uncased") | |
while True: | |
(string, response_q) = await q.get() | |
out = pipe(string) | |
await response_q.put(out) | |
app = Starlette( | |
routes=[ | |
Route("/", homepage, methods=["POST"]), | |
], | |
) | |
@app.on_event("startup") | |
async def startup_event(): | |
q = asyncio.Queue() | |
app.model_queue = q | |
asyncio.create_task(server_loop(q)) | |
``` | |
Now you can start it with: | |
```bash | |
uvicorn server:app | |
``` | |
And you can query it: | |
```bash | |
curl -X POST -d "test [MASK]" http://localhost:8000/ | |
#[{"score":0.7742936015129089,"token":1012,"token_str":".","sequence":"test."},...] | |
``` | |
And there you go, now you have a good idea of how to create a webserver! | |
What is really important is that we load the model only **once**, so there are no copies | |
of the model on the webserver. This way, no unnecessary RAM is being used. | |
Then the queuing mechanism allows you to do fancy stuff like maybe accumulating a few | |
items before inferring to use dynamic batching: | |
```py | |
(string, rq) = await q.get() | |
strings = [] | |
queues = [] | |
while True: | |
try: | |
(string, rq) = await asyncio.wait_for(q.get(), timeout=0.001) # 1ms | |
except asyncio.exceptions.TimeoutError: | |
break | |
strings.append(string) | |
queues.append(rq) | |
strings | |
outs = pipe(strings, batch_size=len(strings)) | |
for rq, out in zip(queues, outs): | |
await rq.put(out) | |
``` | |
<Tip warning={true}> | |
Do not activate this without checking it makes sense for your load! | |
</Tip> | |
The proposed code is optimized for readability, not for being the best code. | |
First of all, there's no batch size limit which is usually not a | |
great idea. Next, the timeout is reset on every queue fetch, meaning you could | |
wait much more than 1ms before running the inference (delaying the first request | |
by that much). | |
It would be better to have a single 1ms deadline. | |
This will always wait for 1ms even if the queue is empty, which might not be the | |
best since you probably want to start doing inference if there's nothing in the queue. | |
But maybe it does make sense if batching is really crucial for your use case. | |
Again, there's really no one best solution. | |
## Few things you might want to consider | |
### Error checking | |
There's a lot that can go wrong in production: out of memory, out of space, | |
loading the model might fail, the query might be wrong, the query might be | |
correct but still fail to run because of a model misconfiguration, and so on. | |
Generally, it's good if the server outputs the errors to the user, so | |
adding a lot of `try..except` statements to show those errors is a good | |
idea. But keep in mind it may also be a security risk to reveal all those errors depending | |
on your security context. | |
### Circuit breaking | |
Webservers usually look better when they do circuit breaking. It means they | |
return proper errors when they're overloaded instead of just waiting for the query indefinitely. Return a 503 error instead of waiting for a super long time or a 504 after a long time. | |
This is relatively easy to implement in the proposed code since there is a single queue. | |
Looking at the queue size is a basic way to start returning errors before your | |
webserver fails under load. | |
### Blocking the main thread | |
Currently PyTorch is not async aware, and computation will block the main | |
thread while running. That means it would be better if PyTorch was forced to run | |
on its own thread/process. This wasn't done here because the code is a lot more | |
complex (mostly because threads and async and queues don't play nice together). | |
But ultimately it does the same thing. | |
This would be important if the inference of single items were long (> 1s) because | |
in this case, it means every query during inference would have to wait for 1s before | |
even receiving an error. | |
### Dynamic batching | |
In general, batching is not necessarily an improvement over passing 1 item at | |
a time (see [batching details](./main_classes/pipelines#pipeline-batching) for more information). But it can be very effective | |
when used in the correct setting. In the API, there is no dynamic | |
batching by default (too much opportunity for a slowdown). But for BLOOM inference - | |
which is a very large model - dynamic batching is **essential** to provide a decent experience for everyone. | |