-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: graceful stop/restart #96147
Comments
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Early in shutdown, stop listening for HTTP requests and gracefully close all HTTP connections. Adds `http.shutdown_grace_period` setting, the maximum amount of time to wait for in-flight HTTP requests to finish. After that time, the http channels are all closed. Graceful shutdown procedure: 1) Stop listening for new HTTP connections 2) Tell all new requests to add `Connection: close` response header and close the channel after the request. 3) Wait up to the grace period for all open connections to close 4) If grace period expired, close all remaining connections Fixes: #96147
Hello, I've only two small questions:
Thank you again |
@blacktek We have no plans to backport it to 7.17. The default value for |
Hello, |
verified and it's merged. now https://www.elastic.co/guide/en/elasticsearch/reference/current/release-notes-8.9.0.html is complete. Infra/Node Lifecycle thank you! |
Yeah, it's there. Thanks for the verification. |
Hello, root@ip-172-23-0-61:~# journalctl -u elasticsearch -f Basically the systemctl restart elasticsearch returned an exit code != 0 and the issue was with: [2023-08-29T05:16:08,439][WARN ][o.e.h.AbstractHttpServerTransport] [ip-172-23-0-61] timed out while waiting [5000]ms for clients to close connections [2023-08-29T05:16:08,450][INFO ][o.e.n.Node ] [ip-172-23-0-61] stopping ... [2023-08-29T05:16:08,452][INFO ][o.e.x.w.WatcherService ] [ip-172-23-0-61] stopping watch service, reason [shutdown initiated] [2023-08-29T05:16:08,453][INFO ][o.e.x.w.WatcherLifeCycleService] [ip-172-23-0-61] watcher has stopped and shutdown [2023-08-29T05:16:08,491][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [ip-172-23-0-61] [controller/3218034] [Main.cc@176] ML controller exiting [2023-08-29T05:16:08,498][INFO ][o.e.x.m.p.NativeController] [ip-172-23-0-61] Native controller process has stopped - no new native processes can be started [2023-08-29T05:16:08,579][INFO ][o.e.c.c.Coordinator ] [ip-172-23-0-61] master node [{ip-172-23-1-62}{Ie7dkFSjSMGLIQF27pKjlA}{Ayn4iNHmR6u1sYYy-z4vzA}{ip-172-23-1-62}{172.23.1.62}{172.23.1.62:9300}{cdfhilmrstw}{8.9.1}] disconnected, restarting discovery [2023-08-29T05:16:08,921][INFO ][o.e.n.Node ] [ip-172-23-0-61] stopped [2023-08-29T05:16:08,921][INFO ][o.e.n.Node ] [ip-172-23-0-61] closing ... [2023-08-29T05:16:08,957][INFO ][o.e.n.Node ] [ip-172-23-0-61] closed Do you have any idea on what might have happened? it's the first time we see a restart error. I've a side question too: do you consider the keepalive connections to elasticsearch as "active connections" waiting for their termination? or do you only look at connections with active queries? We have an nginx proxy forwarding the requests to elasticsearch, with a keep alive timeout of 60 seconds (now reduced to 15 seconds, with a grace period of 16 seconds - to see if happens again) Thank you! |
Hey @blacktek, Graceful shutdown only waits for active requests. Idle connections are shut down. The expected procedure is:
|
Hi @stu-elastic ,
This is surely possible: during the grace period (only known to elasticsearch, not to the proxy) the new requests should be rejected, according to the expected behaviour. Am I wrong? This issue happened only once, so far. Should I open a ticket on https://discuss.elastic.co/ ? What else can I add? tnx |
I wasn't accurate. There were outstanding requests that took more than 5 seconds to complete. ES stops accepting new connections as soon as it sees the sigterm and closes all idle connections as well.
This tells me the elasticsearch unit does is not allowing the process to fully shutdown and so is force killing it. Please check the definition of the unit and make sure it matches the timeout settings being used. |
@stu-elastic Please look: Disable timeout logic and wait until process is stoppedTimeoutStopSec=0 Allow a slow startup before the systemd notifier module kicks in to extend the timeoutTimeoutStartSec=900 WIth TimeoutStopSec=0 we should not have basically any Stop Timeout. Tomorrow we change this setting too: But I think it will have no effect because that timeout is the timeout to wait before restarting a service. It seems that our configuration of the systemd unit is correct. Do you have any other idea? Thank you |
@blacktek |
I'd like to have a feature in elasticsearch server that allows to perform a graceful stop or restart.
Basically after executing a systemctl stop/restart command I would like that the current pending requests are completed first (with a customizable timeout), while not accepting any new request.
This would simplify the draining of inflight requests, without forcing the application to retry connections abruptly closed
You already have something similar with https://www.elastic.co/guide/en/cloud/current/ec-maintenance-mode-deployments.html - but it should work with base elasticsearch server too.
The text was updated successfully, but these errors were encountered: