Mask Generation
sam2

Backend worker did not respond in given time

#1
by payamahv - opened

I encountered an issue while attempting to segment multiple images in a loop. The error I am receiving is:

"Backend worker did not respond in the given time."

org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

It appears that GPU memory is not being cleared properly, and the available variable space is diminishing with each image processed.

2025-01-15T10:16:06.329-08:002025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1 | 2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1 | AllTraffic/i-098f59f854f0cab5b -- | -- | -- | 2025-01-15T10:16:06.329-08:002025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error | 2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time | org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?] | #011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?] | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?] | #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?] | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] | #011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] | #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] | #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at java.lang.Thread.run(Thread.java:840) [?:?] 2025-01-15T10:16:06.329-08:00 2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error

2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time

org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?]

#011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]

#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]

#011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]

#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]

#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.lang.Thread.run(Thread.java:840) [?:?]

I have already done these and still getting the error

Explicitly clear GPU memory: Use torch.cuda.empty_cache() to clear the GPU memory after processing each image.

Delete variables: Ensure that you delete any variables holding large tensors after they are no longer needed using del.

Use with torch.no_grad(): Wrap your inference code with torch.no_grad() to prevent PyTorch from storing intermediate values for backpropagation, which can save memory.

Sign up or log in to comment