Fix child processes not being reaped when Process.detach used (puma…

…#3314) * Fix child processes not being reaped when `Process.detach` used Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when the parent was run as PID 1. For example, I issued a `kill 44` and PID 44 remained in the `defunct` state: ``` git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef UID PID PPID C STIME TTY TIME CMD git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker] git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct> git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker] git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker] git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker] git 5205 0 0 21:57 pts/0 00:00:00 bash git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef ``` Further investigation showed that the introduction of `Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return anything when `Process.detach` is run on some process that has not exited. This bug appears to be present from Ruby 2.6 to 3.2, but has been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/19837 Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each known worker PID. puma#3255 changed this behavior to do this only if the `fork_worker` config parameter were enabled, but it seems that we should always do this to ensure that terminated workers are reaped in a timely manner. Closes puma#3313 * Add integration test for Puma worker reaping This test ensures that Puma handles the `Process.detach` bug described in https://bugs.ruby-lang.org/issues/19837.
dentarg · Jan 26, 2024 · 9bd838b · 9bd838b
1 parent f6ab4e9
commit 9bd838b
Show file tree

Hide file tree

Showing 3 changed files with 47 additions and 5 deletions.
diff --git a/lib/puma/cluster.rb b/lib/puma/cluster.rb
@@ -563,9 +563,12 @@ def wait_workers
       @workers.reject! do |w|
         next false if w.pid.nil?
         begin
-          # When `fork_worker` is enabled, some worker may not be direct children, but grand children.
-          # Because of this they won't be reaped by `Process.wait2(-1)`, so we need to check them individually)
-          if reaped_children.delete(w.pid) || (@options[:fork_worker] && Process.wait(w.pid, Process::WNOHANG))
+          # We may need to check the PID individually because:
+          # 1. From Ruby versions 2.6 to 3.2, `Process.detach` can prevent or delay
+          #    `Process.wait2(-1)` from detecting a terminated process: https://bugs.ruby-lang.org/issues/19837.
+          # 2. When `fork_worker` is enabled, some worker may not be direct children,
+          #    but grand children.  Because of this they won't be reaped by `Process.wait2(-1)`.
+          if reaped_children.delete(w.pid) || Process.wait(w.pid, Process::WNOHANG)
             true
           else
             w.term if w.term?

diff --git a/test/config/process_detach_before_fork.rb b/test/config/process_detach_before_fork.rb
@@ -0,0 +1,11 @@
+worker_shutdown_timeout 0
+
+before_fork do
+  pid = fork do
+    sleep 30 # This has to exceed the test timeout
+  end
+
+  pid_filename = File.join(Dir.tmpdir, 'process_detach_test.pid')
+  File.write(pid_filename, pid)
+  Process.detach(pid)
+end
diff --git a/test/test_integration_cluster.rb b/test/test_integration_cluster.rb
@@ -175,6 +175,34 @@ def test_stuck_external_term_spawn
     end
   end
 
+  # From Ruby 2.6 to 3.2, `Process.detach` can delay or prevent
+  # `Process.wait2(-1)` from detecting a terminated child:
+  # https://bugs.ruby-lang.org/issues/19837. However,
+  # `Process.wait2(<child pid>)` still works properly. This bug has
+  # been fixed in Ruby 3.3.
+  def test_workers_respawn_with_process_detach
+    skip_unless_signal_exist? :KILL
+
+    config = 'test/config/process_detach_before_fork.rb'
+
+    worker_respawn(0, workers, config) do |phase0_worker_pids|
+      phase0_worker_pids.each do |pid|
+        Process.kill :KILL, pid
+      end
+    end
+
+    # `test/config/process_detach_before_fork.rb` forks and detaches a
+    # process.  Since MiniTest attempts to join all threads before
+    # finishing, terminate the process so that the test can end quickly
+    # if it passes.
+    pid_filename = File.join(Dir.tmpdir, 'process_detach_test.pid')
+    if File.exist?(pid_filename)
+      pid = File.read(pid_filename).chomp.to_i
+      File.unlink(pid_filename)
+      Process.kill :TERM, pid if pid > 0
+    end
+  end
+
   # mimicking stuck workers, test restart
   def test_stuck_phased_restart
     skip_unless_signal_exist? :USR1
@@ -681,10 +709,10 @@ def usr1_all_respond(unix: false, config: '')
     end
   end
 
-  def worker_respawn(phase = 1, size = workers)
+  def worker_respawn(phase = 1, size = workers, config = 'test/config/worker_shutdown_timeout_2.rb')
     threads = []
 
-    cli_server "-w #{workers} -t 1:1 -C test/config/worker_shutdown_timeout_2.rb test/rackup/sleep_pid.ru"
+    cli_server "-w #{workers} -t 1:1 -C #{config} test/rackup/sleep_pid.ru"
 
     # make sure two workers have booted
     phase0_worker_pids = get_worker_pids