Feature request: graceful stop/restart #96147

blacktek · 2023-05-16T10:53:35Z

I'd like to have a feature in elasticsearch server that allows to perform a graceful stop or restart.

Basically after executing a systemctl stop/restart command I would like that the current pending requests are completed first (with a customizable timeout), while not accepting any new request.

This would simplify the draining of inflight requests, without forcing the application to retry connections abruptly closed

You already have something similar with https://www.elastic.co/guide/en/cloud/current/ec-maintenance-mode-deployments.html - but it should work with base elasticsearch server too.

elasticsearchmachine · 2023-05-16T12:56:43Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

Early in shutdown, stop listening for HTTP requests and gracefully close all HTTP connections. Adds `http.shutdown_grace_period` setting, the maximum amount of time to wait for in-flight HTTP requests to finish. After that time, the http channels are all closed. Graceful shutdown procedure: 1) Stop listening for new HTTP connections 2) Tell all new requests to add `Connection: close` response header and close the channel after the request. 3) Wait up to the grace period for all open connections to close 4) If grace period expired, close all remaining connections Fixes: #96147

blacktek · 2023-06-07T07:04:50Z

Hello,
Thank you for this quick add-on!

I've only two small questions:

when will it be released? I see it's merged to main branch. Now I'm running 7.17.10 - will it be released with the next minor version I'll get from apt? (of course, it depends on Ubuntu too)
what is/will be the default value for http.shutdown_grace_period ?

Thank you again

stu-elastic · 2023-06-07T15:09:23Z

@blacktek
It will be released with 8.9, which is the next minor in the 8 series. We don't share dates, but you can check out our minor release history to get an expectation of our cadence.

We have no plans to backport it to 7.17.

The default value for http.shutdown_grace_period is zero, which means there is no grace period. It takes a TimeValue. There will be docs for it in 8.9.

blacktek · 2023-08-01T13:50:47Z

@blacktek It will be released with 8.9, which is the next minor in the 8 series. We don't share dates, but you can check out our minor release history to get an expectation of our cadence.

Hello,
can you confirm if this features has been released in the elasticsearch 8.9 just released? it seems the case, but I want to double check :)

blacktek · 2023-08-07T07:44:37Z

verified and it's merged.

now https://www.elastic.co/guide/en/elasticsearch/reference/current/release-notes-8.9.0.html is complete.

Infra/Node Lifecycle
Gracefully shutdown elasticsearch #96363

thank you!

stu-elastic · 2023-08-09T16:53:50Z

Yeah, it's there. Thanks for the verification.

blacktek · 2023-08-30T07:30:07Z

Hello,
I've upgraded my elasticsearch cluster to 8.9.1 and yesterday we had a strange issue during a reboot:

root@ip-172-23-0-61:~# journalctl -u elasticsearch -f
Aug 29 05:16:10 ip-172-23-0-61 systemd[1]: elasticsearch.service: Deactivated successfully.
Aug 29 05:16:10 ip-172-23-0-61 systemd[1]: Stopped Elasticsearch.
Aug 29 05:16:10 ip-172-23-0-61 systemd[1]: elasticsearch.service: Consumed 6h 5min 41.223s CPU time.
-- Boot 4d5b530e70c24846b9180dca971172cf --
Aug 29 05:19:39 ip-172-23-0-61 systemd[1]: Starting Elasticsearch...
Aug 29 05:19:45 ip-172-23-0-61 systemd[1]: elasticsearch.service: Deactivated successfully.
Aug 29 05:19:45 ip-172-23-0-61 systemd[1]: elasticsearch.service: Unit process 1565 (java) remains running after unit stopped.
Aug 29 05:19:45 ip-172-23-0-61 systemd[1]: Stopped Elasticsearch.
Aug 29 05:19:45 ip-172-23-0-61 systemd[1]: elasticsearch.service: Consumed 8.021s CPU time.
-- Boot b8b341da7ee645a5a7f3ca62f4be569f --
Aug 29 05:20:17 ip-172-23-0-61 systemd[1]: Starting Elasticsearch...
Aug 29 05:20:38 ip-172-23-0-61 systemd[1]: Started Elasticsearch.

Basically the systemctl restart elasticsearch returned an exit code != 0 and the issue was with:
Unit process 1565 (java) remains running after unit stopped.

[2023-08-29T05:16:08,439][WARN ][o.e.h.AbstractHttpServerTransport] [ip-172-23-0-61] timed out while waiting [5000]ms for clients to close connections

[2023-08-29T05:16:08,450][INFO ][o.e.n.Node ] [ip-172-23-0-61] stopping ...

[2023-08-29T05:16:08,452][INFO ][o.e.x.w.WatcherService ] [ip-172-23-0-61] stopping watch service, reason [shutdown initiated]

[2023-08-29T05:16:08,453][INFO ][o.e.x.w.WatcherLifeCycleService] [ip-172-23-0-61] watcher has stopped and shutdown

[2023-08-29T05:16:08,491][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [ip-172-23-0-61] [controller/3218034] [Main.cc@176] ML controller exiting

[2023-08-29T05:16:08,498][INFO ][o.e.x.m.p.NativeController] [ip-172-23-0-61] Native controller process has stopped - no new native processes can be started

[2023-08-29T05:16:08,579][INFO ][o.e.c.c.Coordinator ] [ip-172-23-0-61] master node [{ip-172-23-1-62}{Ie7dkFSjSMGLIQF27pKjlA}{Ayn4iNHmR6u1sYYy-z4vzA}{ip-172-23-1-62}{172.23.1.62}{172.23.1.62:9300}{cdfhilmrstw}{8.9.1}] disconnected, restarting discovery

[2023-08-29T05:16:08,921][INFO ][o.e.n.Node ] [ip-172-23-0-61] stopped

[2023-08-29T05:16:08,921][INFO ][o.e.n.Node ] [ip-172-23-0-61] closing ...

[2023-08-29T05:16:08,957][INFO ][o.e.n.Node ] [ip-172-23-0-61] closed

Do you have any idea on what might have happened? it's the first time we see a restart error.

I've a side question too: do you consider the keepalive connections to elasticsearch as "active connections" waiting for their termination? or do you only look at connections with active queries?

We have an nginx proxy forwarding the requests to elasticsearch, with a keep alive timeout of 60 seconds (now reduced to 15 seconds, with a grace period of 16 seconds - to see if happens again)

Thank you!

stu-elastic · 2023-08-30T14:56:06Z

Hey @blacktek,
That tells me there was a 5 second grace period but requests kept coming into the node. Further debugging is more appropriate for the forums.

Graceful shutdown only waits for active requests. Idle connections are shut down.

The expected procedure is:

External: Remove from proxy (no new requests)
ES: Kill all idle connections
ES: Wait for current requests to finish.
Continue shutting down

blacktek · 2023-08-30T15:06:17Z

Hi @stu-elastic ,

That tells me there was a 5 second grace period but requests kept coming into the node

This is surely possible: during the grace period (only known to elasticsearch, not to the proxy) the new requests should be rejected, according to the expected behaviour.

Am I wrong? This issue happened only once, so far.

Should I open a ticket on https://discuss.elastic.co/ ? What else can I add?

tnx

stu-elastic · 2023-08-30T19:11:55Z

I wasn't accurate. There were outstanding requests that took more than 5 seconds to complete. ES stops accepting new connections as soon as it sees the sigterm and closes all idle connections as well.

Unit process 1565 (java) remains running after unit stopped.

This tells me the elasticsearch unit does is not allowing the process to fully shutdown and so is force killing it. Please check the definition of the unit and make sure it matches the timeout settings being used.

blacktek · 2023-08-30T19:45:34Z

@stu-elastic
thank you again for your quick feedback.

Please look:
root@ip-172-23-0-61:~# systemctl cat elasticsearch|grep -i timeout

Disable timeout logic and wait until process is stopped

TimeoutStopSec=0

Allow a slow startup before the systemd notifier module kicks in to extend the timeout

TimeoutStartSec=900
root@ip-172-23-0-61:~#

WIth TimeoutStopSec=0 we should not have basically any Stop Timeout.

Tomorrow we change this setting too:
RestartSec=5s
to
RestartSec=11s

But I think it will have no effect because that timeout is the timeout to wait before restarting a service.

It seems that our configuration of the systemd unit is correct.

Do you have any other idea?

Thank you

stu-elastic · 2023-08-30T20:08:36Z

@blacktek
Please follow up with a discuss post in the support forum.

elasticsearchmachine added the needs:triage label May 16, 2023

pgomulka added :Core/Infra/Node Lifecycle >enhancement and removed needs:triage labels May 16, 2023

elasticsearchmachine added the Team:Core/Infra label May 16, 2023

stu-elastic mentioned this issue Jun 1, 2023

Gracefully shutdown elasticsearch #96363

Merged

stu-elastic closed this as completed in #96363 Jun 6, 2023

GRomR1 mentioned this issue Jun 14, 2023

Shutdown API for shutting down nodes in a safe manner opensearch-project/OpenSearch#1304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: graceful stop/restart #96147

Feature request: graceful stop/restart #96147

blacktek commented May 16, 2023

elasticsearchmachine commented May 16, 2023

blacktek commented Jun 7, 2023

stu-elastic commented Jun 7, 2023

blacktek commented Aug 1, 2023

blacktek commented Aug 7, 2023

stu-elastic commented Aug 9, 2023

blacktek commented Aug 30, 2023

stu-elastic commented Aug 30, 2023

blacktek commented Aug 30, 2023

stu-elastic commented Aug 30, 2023

blacktek commented Aug 30, 2023

stu-elastic commented Aug 30, 2023

Feature request: graceful stop/restart #96147

Feature request: graceful stop/restart #96147

Comments

blacktek commented May 16, 2023

elasticsearchmachine commented May 16, 2023

blacktek commented Jun 7, 2023

stu-elastic commented Jun 7, 2023

blacktek commented Aug 1, 2023

blacktek commented Aug 7, 2023

stu-elastic commented Aug 9, 2023

blacktek commented Aug 30, 2023

stu-elastic commented Aug 30, 2023

blacktek commented Aug 30, 2023

stu-elastic commented Aug 30, 2023

blacktek commented Aug 30, 2023

Disable timeout logic and wait until process is stopped

Allow a slow startup before the systemd notifier module kicks in to extend the timeout

stu-elastic commented Aug 30, 2023