Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rb] Handle graceful webdriver shutdown #14430

Merged
merged 4 commits into from
Nov 15, 2024

Conversation

josegomezr
Copy link
Contributor

@josegomezr josegomezr commented Aug 23, 2024

User description

Improve the shutdown sequence by separating a dead child pid (not running anymore) from a running child pid.

Description

When shutting down Selenium::WebDriver::ServiceManager#stop_server issues a /shutdown request against the webdriver server and the server exits cleanly, however the mechanism to assert if the child process is up or not cannot distinguish between busy or not found.

Selenium::WebDriver::ChildProcess#exited? returns false when the process is still alive because
Selenium::WebDriver::ChildProcess#waitpid2 returns nil.

However, by catching Errno::ECHILD and Errno::ESRCH and returning nil #waitpid2 masks a not running pid as "running" delaying the shutdown procedure when the server was already gone.

This patch moves the exception handling away from the Process wrappers so its closer to the decision points in
Selenium::WebDriver::ChildProcess#exited? and
Selenium::WebDriver::ChildProcess#stop.

Addresses a similar problem as in #14032. With this patch the shutdown sequence looks like:

  1. Send a /shutdown signal to the server [if supported]
    • Check for the server pid
      • If it cannot be found (Errno::ECHILD, Errno::ESRCH), then assume the server has shutdown.
      • else: wait and retry in ~100 ms.
    • repeat for 20 seconds (Selenium::WebDriver::ServiceManager::STOP_TIMEOUT).
  2. Try to send a TERM signal to the server.
    • Check for the server pid
      • If it cannot be found (Errno::ECHILD, Errno::ESRCH), then assume the server has shutdown.
      • else: wait and retry in ~100 ms.
    • repeat for 3 seconds (default value of timeout inSelenium::WebDriver::ChildProcess#stop ).
  3. Try to send a KILL signal to the server.
    • Check for the server pid
      • If it cannot be found (Errno::ECHILD, Errno::ESRCH), then assume the server has shutdown.
      • else: wait and retry in ~100 ms.

Motivation and Context

Noticed that the tests will end, but it'll be hung for 10~20 seconds doing nothing. This patch reduces my test run time from ~32s to ~13s consistently.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Logs of a production application

with selenium-webdrivers 4.23.0

time RAILS_ENV=test bex teaspoon
# [...]
2024-08-23 16:16:10 DEBUG Selenium [:process] Starting process: ["/usr/bin/chromedriver", "--port=9515"] with {[:out, :err]=>#<IO:<STDERR>>, :pgroup=>true} 
2024-08-23 16:16:10 DEBUG Selenium [:process]   -> pid: 27757 
# [...]
........................................................................................................................................

Finished in 1.78500 seconds
136 examples, 0 failures
2024-08-23 16:16:18 DEBUG Selenium [:command] -> DELETE session/60ed9bb863325bb8238f2cc86bee81c6 
2024-08-23 16:16:19 DEBUG Selenium [:header]    <<<  {"content-length"=>["14"], "content-type"=>["application/json; charset=utf-8"], "cache-control"=>["no-cache"]} 
2024-08-23 16:16:19 DEBUG Selenium [:command] <- {"value":null} 
2024-08-23 16:16:19 DEBUG Selenium [:process] Checking if 27757 is exited: 
2024-08-23 16:16:19 DEBUG Selenium [:process] Polling 20 seconds for exit of 27757 
2024-08-23 16:16:19 DEBUG Selenium [:process] Checking if 27757 is exited: 
2024-08-23 16:16:19 DEBUG Selenium [:process] Checking if 27757 is exited: 


real    0m33.907s
user    0m6.563s
sys     0m1.045s

with this patch:

2024-08-23 16:17:55 DEBUG Selenium [:process] Starting process: ["/usr/bin/chromedriver", "--port=9515"] with {[:out, :err]=>#<IO:<STDERR>>, :pgroup=>true} 
2024-08-23 16:17:55 DEBUG Selenium [:process]   -> pid: 28825 

........................................................................................................................................

Finished in 1.54100 seconds
136 examples, 0 failures
2024-08-23 16:18:03 DEBUG Selenium [:command] -> DELETE session/93d5a7d278d25585438a62a60a0625ca 
2024-08-23 16:18:03 DEBUG Selenium [:header]    <<<  {"content-length"=>["14"], "content-type"=>["application/json; charset=utf-8"], "cache-control"=>["no-cache"]} 
2024-08-23 16:18:03 DEBUG Selenium [:command] <- {"value":null} 
2024-08-23 16:18:03 DEBUG Selenium [:process] Checking if 28825 is exited: 
2024-08-23 16:18:03 DEBUG Selenium [:driver_service] Sending shutdown request to server 
2024-08-23 16:18:03 DEBUG Selenium [:process] Polling 20 seconds for exit of 28825 
2024-08-23 16:18:03 DEBUG Selenium [:process] Checking if 28825 is exited: 
2024-08-23 16:18:03 DEBUG Selenium [:process] Checking if 28825 is exited: 
2024-08-23 16:18:03 DEBUG Selenium [:process]   -> process: 28825 already finished 
2024-08-23 16:18:03 DEBUG Selenium [:process] Checking if 28825 is exited: 
2024-08-23 16:18:03 DEBUG Selenium [:process]   -> process: 28825 already finished 
2024-08-23 16:18:03 DEBUG Selenium [:process] Checking if 28825 is exited: 
2024-08-23 16:18:03 DEBUG Selenium [:process]   -> process: 28825 already finished 

2024-08-23 16:18:03 DEBUG Selenium [:process] Checking if 28825 is exited: 
2024-08-23 16:18:03 DEBUG Selenium [:process]   -> process: 28825 already finished 
2024-08-23 16:18:03 DEBUG Selenium [:process] Checking if 28825 is exited: 
2024-08-23 16:18:03 DEBUG Selenium [:process]   -> process: 28825 already finished 

real    0m12.858s
user    0m6.254s
sys     0m1.049s

PR Type

enhancement, tests


Description

  • Refactored the process termination logic in ChildProcess to improve clarity and error handling.
  • Introduced a new method terminate_and_wait_else_kill to encapsulate termination logic.
  • Enhanced logging for process termination and server shutdown requests to aid in debugging.
  • Improved handling of Errno::ECHILD and Errno::ESRCH exceptions to correctly identify exited processes.

Changes walkthrough 📝

Relevant files
Enhancement
child_process.rb
Refactor and enhance process termination and exit handling

rb/lib/selenium/webdriver/common/child_process.rb

  • Refactored process termination logic into a new method
    terminate_and_wait_else_kill.
  • Added debug logging for process termination and exit checks.
  • Improved error handling for process exit checks with specific
    exceptions.
  • +20/-14 
    service_manager.rb
    Add debug logging for server shutdown requests                     

    rb/lib/selenium/webdriver/common/service_manager.rb

    • Added debug logging for server shutdown requests.
    +1/-0     

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    When shutting down `Selenium::WebDriver::ServiceManager#stop_server`
    issues a `/shutdown` request against the webdriver server and the
    server exits cleanly, however the mechanism to assert if the child
    process is up or not cannot distinguish between busy or not found.
    
    `Selenium::WebDriver::ChildProcess#exited?` returns `false` when the
    process is still alive because
    `Selenium::WebDriver::ChildProcess#waitpid2` returns `nil`.
    
    However, by catching `Errno::ECHILD` and `Errno::ESRCH` and returning
    `nil` `#waitpid2` masks a not running pid as "running" delaying the
    shutdown procedure when the server was already gone.
    
    This patch moves the exception handling away from the `Process` wrappers
    so its closer to the decision points in
    `Selenium::WebDriver::ChildProcess#exited?` and
    `Selenium::WebDriver::ChildProcess#stop`.
    
    Addresses a similar inconsistency in SeleniumHQ#14032
    @CLAassistant
    Copy link

    CLAassistant commented Aug 23, 2024

    CLA assistant check
    All committers have signed the CLA.

    Copy link
    Contributor

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Error Handling
    The new error handling in the stop method might mask other potential errors. Consider adding a more specific rescue clause or logging the error details.

    Code Duplication
    The terminate_and_wait_else_kill method contains logic that was previously in the stop method. Consider if this abstraction is necessary or if it adds unnecessary complexity.

    Copy link
    Contributor

    qodo-merge-pro bot commented Aug 23, 2024

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Error handling
    Add error handling for the server shutdown HTTP request

    In the stop_server method, consider adding error handling for the HTTP request. This
    can help manage potential network issues or server errors during shutdown.

    rb/lib/selenium/webdriver/common/service_manager.rb [109-115]

     def stop_server
       connect_to_server do |http|
         headers = WebDriver::Remote::Http::Common::DEFAULT_HEADERS.dup
         WebDriver.logger.debug("Sending shutdown request to server", id: :driver_service)
    -    http.get('/shutdown', headers)
    +    begin
    +      http.get('/shutdown', headers)
    +    rescue StandardError => e
    +      WebDriver.logger.warn("Error during server shutdown: #{e.message}", id: :driver_service)
    +    end
       end
     end
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Implementing error handling for the HTTP request in stop_server is a significant improvement as it addresses potential network issues or server errors, enhancing the robustness of the code.

    8
    Enhancement
    Enhance error logging by including the specific exception type

    In the exited? method, consider logging the specific exception type when catching
    Errno::ECHILD or Errno::ESRCH. This can provide more detailed debugging information.

    rb/lib/selenium/webdriver/common/child_process.rb [86-88]

    -rescue Errno::ECHILD, Errno::ESRCH
    -  WebDriver.logger.debug("  -> process: #{@pid} already finished", id: :process)
    +rescue Errno::ECHILD, Errno::ESRCH => e
    +  WebDriver.logger.debug("  -> process: #{@pid} already finished (#{e.class.name})", id: :process)
       return true
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Adding the specific exception type to the log message enhances debugging and provides more context, which is beneficial for maintenance and troubleshooting.

    7
    Improve method name to better reflect its functionality

    Consider using a more descriptive method name for terminate_and_wait_else_kill. A
    name like graceful_shutdown_with_fallback might better convey the method's behavior.

    rb/lib/selenium/webdriver/common/child_process.rb [105-116]

    -def terminate_and_wait_else_kill(timeout)
    +def graceful_shutdown_with_fallback(timeout)
       WebDriver.logger.debug("Sending TERM to process: #{@pid}", id: :process)
       terminate(@pid)
       poll_for_exit(timeout)
     
       WebDriver.logger.debug("  -> stopped #{@pid}", id: :process)
     rescue TimeoutError, Errno::EINVAL
       WebDriver.logger.debug("    -> sending KILL to process: #{@pid}", id: :process)
       kill(@pid)
       wait
       WebDriver.logger.debug("      -> killed #{@pid}", id: :process)
     end
     
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: The suggestion to rename the method to graceful_shutdown_with_fallback improves code readability and better describes the method's functionality, but it is not crucial for functionality.

    6

    @VietND96 VietND96 requested review from p0deje and aguspe November 15, 2024 03:40
    @p0deje
    Copy link
    Member

    p0deje commented Nov 15, 2024

    Sorry for taking too long on this one, can you please address RuboCop complaints https://github.com/SeleniumHQ/selenium/actions/runs/11849853755/job/33048613805?pr=14430 and we'll merge afterwards.

    @p0deje p0deje merged commit 7f4d4bf into SeleniumHQ:trunk Nov 15, 2024
    22 of 23 checks passed
    @p0deje
    Copy link
    Member

    p0deje commented Nov 15, 2024

    Thank you for the contribution!

    @josegomezr
    Copy link
    Contributor Author

    Thank y'all. Until the next bug feature! ✌️

    jkim2492 pushed a commit to jkim2492/selenium that referenced this pull request Nov 17, 2024
    When shutting down `Selenium::WebDriver::ServiceManager#stop_server`
    issues a `/shutdown` request against the webdriver server and the
    server exits cleanly, however the mechanism to assert if the child
    process is up or not cannot distinguish between busy or not found.
    
    `Selenium::WebDriver::ChildProcess#exited?` returns `false` when the
    process is still alive because
    `Selenium::WebDriver::ChildProcess#waitpid2` returns `nil`.
    
    However, by catching `Errno::ECHILD` and `Errno::ESRCH` and returning
    `nil` `#waitpid2` masks a not running pid as "running" delaying the
    shutdown procedure when the server was already gone.
    
    This patch moves the exception handling away from the `Process` wrappers
    so its closer to the decision points in
    `Selenium::WebDriver::ChildProcess#exited?` and
    `Selenium::WebDriver::ChildProcess#stop`.
    
    Addresses a similar inconsistency in SeleniumHQ#14032
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    4 participants