Sanan bin Tahir Kevin Hu commited on
Commit
7bc5b54
·
1 Parent(s): 480ea1f

Change launch backend script to handle errors gracefully (#3334)

Browse files

### What problem does this PR solve?

The `launch_backend_service.sh` script enters infinite loops for both
the task executors and the backend server. When an error occurs in any
of these processes, the script continuously restarts them without
properly handling termination signals. This behavior causes the script
to even ignore interrupts, leading to persistent error messages and
making it difficult to exit the script gracefully.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Explanation of Modifications

1. **Signal Trapping with `trap`:**
- The `trap cleanup SIGINT SIGTERM` line ensures that when a `SIGINT` or
`SIGTERM` signal is received, the cleanup function is invoked.
- The `cleanup` function sets the `STOP` flag to `true`, iterates
through all child process IDs stored in the `PIDS` array, and sends a
`kill` signal to each process to terminate them gracefully.
2. **Retry Limits:**
- Introduced a `MAX_RETRIES` variable to limit the number of restart
attempts for both `task_executor.py` and `ragflow_server.py`
- The loops now check if the retry count has reached the maximum limit.
If so, they invoke the `cleanup` function to terminate all processes and
exit the script.
3. **Process Tracking with `PIDS` Array:**
- After launching each background process (`task_exe` and `run_server`),
their Process IDs (PIDs) are stored in the `PIDS` array.
- This allows the `cleanup` function to terminate all child processes
effectively when needed.
4. **Graceful Shutdown:**
- When the `cleanup` function is called, it iterates over all child PIDs
and sends a termination signal (`kill`) to each, ensuring that all
subprocesses are stopped before the script exits.
5. **Logging Enhancements:**
- Added `echo` statements to provide clearer logs about the state of
each process, including attempts, successes, failures, and retries.
6. **Exit on Successful Completion:**
- If `ragflow_server.py` or a `task_executor.py` process exits with a
success code (0), the loop breaks, preventing unnecessary retries.

Co-authored-by: Kevin Hu <[email protected]>

Files changed (1) hide show
  1. docker/launch_backend_service.sh +84 -9
docker/launch_backend_service.sh CHANGED
@@ -1,28 +1,103 @@
1
  #!/bin/bash
2
 
3
- # unset http proxy which maybe set by docker daemon
 
 
 
4
  export http_proxy=""; export https_proxy=""; export no_proxy=""; export HTTP_PROXY=""; export HTTPS_PROXY=""; export NO_PROXY=""
5
 
6
  export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/
7
 
8
  PY=python3
 
 
9
  if [[ -z "$WS" || $WS -lt 1 ]]; then
10
  WS=1
11
  fi
12
 
13
- function task_exe(){
14
- while [ 1 -eq 1 ];do
15
- $PY rag/svr/task_executor.py $1;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  done
 
 
 
 
 
17
  }
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  for ((i=0;i<WS;i++))
20
  do
21
- task_exe $i &
 
22
  done
23
 
24
- while [ 1 -eq 1 ];do
25
- $PY api/ragflow_server.py
26
- done
27
 
28
- wait;
 
 
1
  #!/bin/bash
2
 
3
+ # Exit immediately if a command exits with a non-zero status
4
+ set -e
5
+
6
+ # Unset HTTP proxies that might be set by Docker daemon
7
  export http_proxy=""; export https_proxy=""; export no_proxy=""; export HTTP_PROXY=""; export HTTPS_PROXY=""; export NO_PROXY=""
8
 
9
  export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/
10
 
11
  PY=python3
12
+
13
+ # Set default number of workers if WS is not set or less than 1
14
  if [[ -z "$WS" || $WS -lt 1 ]]; then
15
  WS=1
16
  fi
17
 
18
+ # Maximum number of retries for each task executor and server
19
+ MAX_RETRIES=5
20
+
21
+ # Flag to control termination
22
+ STOP=false
23
+
24
+ # Array to keep track of child PIDs
25
+ PIDS=()
26
+
27
+ # Function to handle termination signals
28
+ cleanup() {
29
+ echo "Termination signal received. Shutting down..."
30
+ STOP=true
31
+ # Terminate all child processes
32
+ for pid in "${PIDS[@]}"; do
33
+ if kill -0 "$pid" 2>/dev/null; then
34
+ echo "Killing process $pid"
35
+ kill "$pid"
36
+ fi
37
+ done
38
+ exit 0
39
+ }
40
+
41
+ # Trap SIGINT and SIGTERM to invoke cleanup
42
+ trap cleanup SIGINT SIGTERM
43
+
44
+ # Function to execute task_executor with retry logic
45
+ task_exe(){
46
+ local task_id=$1
47
+ local retry_count=0
48
+ while ! $STOP && [ $retry_count -lt $MAX_RETRIES ]; do
49
+ echo "Starting task_executor.py for task $task_id (Attempt $((retry_count+1)))"
50
+ $PY rag/svr/task_executor.py "$task_id"
51
+ EXIT_CODE=$?
52
+ if [ $EXIT_CODE -eq 0 ]; then
53
+ echo "task_executor.py for task $task_id exited successfully."
54
+ break
55
+ else
56
+ echo "task_executor.py for task $task_id failed with exit code $EXIT_CODE. Retrying..." >&2
57
+ retry_count=$((retry_count + 1))
58
+ sleep 2
59
+ fi
60
  done
61
+
62
+ if [ $retry_count -ge $MAX_RETRIES ]; then
63
+ echo "task_executor.py for task $task_id failed after $MAX_RETRIES attempts. Exiting..." >&2
64
+ cleanup
65
+ fi
66
  }
67
 
68
+ # Function to execute ragflow_server with retry logic
69
+ run_server(){
70
+ local retry_count=0
71
+ while ! $STOP && [ $retry_count -lt $MAX_RETRIES ]; do
72
+ echo "Starting ragflow_server.py (Attempt $((retry_count+1)))"
73
+ $PY api/ragflow_server.py
74
+ EXIT_CODE=$?
75
+ if [ $EXIT_CODE -eq 0 ]; then
76
+ echo "ragflow_server.py exited successfully."
77
+ break
78
+ else
79
+ echo "ragflow_server.py failed with exit code $EXIT_CODE. Retrying..." >&2
80
+ retry_count=$((retry_count + 1))
81
+ sleep 2
82
+ fi
83
+ done
84
+
85
+ if [ $retry_count -ge $MAX_RETRIES ]; then
86
+ echo "ragflow_server.py failed after $MAX_RETRIES attempts. Exiting..." >&2
87
+ cleanup
88
+ fi
89
+ }
90
+
91
+ # Start task executors
92
  for ((i=0;i<WS;i++))
93
  do
94
+ task_exe "$i" &
95
+ PIDS+=($!)
96
  done
97
 
98
+ # Start the main server
99
+ run_server &
100
+ PIDS+=($!)
101
 
102
+ # Wait for all background processes to finish
103
+ wait