Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: graceful stop/restart #96147

Closed
blacktek opened this issue May 16, 2023 · 12 comments · Fixed by #96363
Closed

Feature request: graceful stop/restart #96147

blacktek opened this issue May 16, 2023 · 12 comments · Fixed by #96363
Labels
:Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown >enhancement Team:Core/Infra Meta label for core/infra team

Comments

@blacktek
Copy link

I'd like to have a feature in elasticsearch server that allows to perform a graceful stop or restart.

Basically after executing a systemctl stop/restart command I would like that the current pending requests are completed first (with a customizable timeout), while not accepting any new request.

This would simplify the draining of inflight requests, without forcing the application to retry connections abruptly closed

You already have something similar with https://www.elastic.co/guide/en/cloud/current/ec-maintenance-mode-deployments.html - but it should work with base elasticsearch server too.

@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label May 16, 2023
@pgomulka pgomulka added :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown >enhancement and removed needs:triage Requires assignment of a team area label labels May 16, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label May 16, 2023
stu-elastic added a commit that referenced this issue Jun 6, 2023
Early in shutdown, stop listening for HTTP requests and gracefully close all HTTP connections.

Adds `http.shutdown_grace_period` setting, the maximum amount of time to wait for in-flight HTTP requests to finish.  After that time, the http channels are all closed.

Graceful shutdown procedure:
1) Stop listening for new HTTP connections
2) Tell all new requests to add `Connection: close` response header and close the channel after the request.
3) Wait up to the grace period for all open connections to close
4) If grace period expired, close all remaining connections

Fixes: #96147
@blacktek
Copy link
Author

blacktek commented Jun 7, 2023

Hello,
Thank you for this quick add-on!

I've only two small questions:

  1. when will it be released? I see it's merged to main branch. Now I'm running 7.17.10 - will it be released with the next minor version I'll get from apt? (of course, it depends on Ubuntu too)
  2. what is/will be the default value for http.shutdown_grace_period ?

Thank you again

@stu-elastic
Copy link
Contributor

@blacktek
It will be released with 8.9, which is the next minor in the 8 series. We don't share dates, but you can check out our minor release history to get an expectation of our cadence.

We have no plans to backport it to 7.17.

The default value for http.shutdown_grace_period is zero, which means there is no grace period. It takes a TimeValue. There will be docs for it in 8.9.

@blacktek
Copy link
Author

blacktek commented Aug 1, 2023

@blacktek It will be released with 8.9, which is the next minor in the 8 series. We don't share dates, but you can check out our minor release history to get an expectation of our cadence.

Hello,
can you confirm if this features has been released in the elasticsearch 8.9 just released? it seems the case, but I want to double check :)

@blacktek
Copy link
Author

blacktek commented Aug 7, 2023

verified and it's merged.

now https://www.elastic.co/guide/en/elasticsearch/reference/current/release-notes-8.9.0.html is complete.

Infra/Node Lifecycle
Gracefully shutdown elasticsearch #96363

thank you!

@stu-elastic
Copy link
Contributor

Yeah, it's there. Thanks for the verification.

@blacktek
Copy link
Author

Hello,
I've upgraded my elasticsearch cluster to 8.9.1 and yesterday we had a strange issue during a reboot:

root@ip-172-23-0-61:~# journalctl -u elasticsearch -f
Aug 29 05:16:10 ip-172-23-0-61 systemd[1]: elasticsearch.service: Deactivated successfully.
Aug 29 05:16:10 ip-172-23-0-61 systemd[1]: Stopped Elasticsearch.
Aug 29 05:16:10 ip-172-23-0-61 systemd[1]: elasticsearch.service: Consumed 6h 5min 41.223s CPU time.
-- Boot 4d5b530e70c24846b9180dca971172cf --
Aug 29 05:19:39 ip-172-23-0-61 systemd[1]: Starting Elasticsearch...
Aug 29 05:19:45 ip-172-23-0-61 systemd[1]: elasticsearch.service: Deactivated successfully.
Aug 29 05:19:45 ip-172-23-0-61 systemd[1]: elasticsearch.service: Unit process 1565 (java) remains running after unit stopped.
Aug 29 05:19:45 ip-172-23-0-61 systemd[1]: Stopped Elasticsearch.
Aug 29 05:19:45 ip-172-23-0-61 systemd[1]: elasticsearch.service: Consumed 8.021s CPU time.
-- Boot b8b341da7ee645a5a7f3ca62f4be569f --
Aug 29 05:20:17 ip-172-23-0-61 systemd[1]: Starting Elasticsearch...
Aug 29 05:20:38 ip-172-23-0-61 systemd[1]: Started Elasticsearch.

Basically the systemctl restart elasticsearch returned an exit code != 0 and the issue was with:
Unit process 1565 (java) remains running after unit stopped.

[2023-08-29T05:16:08,439][WARN ][o.e.h.AbstractHttpServerTransport] [ip-172-23-0-61] timed out while waiting [5000]ms for clients to close connections

[2023-08-29T05:16:08,450][INFO ][o.e.n.Node ] [ip-172-23-0-61] stopping ...

[2023-08-29T05:16:08,452][INFO ][o.e.x.w.WatcherService ] [ip-172-23-0-61] stopping watch service, reason [shutdown initiated]

[2023-08-29T05:16:08,453][INFO ][o.e.x.w.WatcherLifeCycleService] [ip-172-23-0-61] watcher has stopped and shutdown

[2023-08-29T05:16:08,491][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [ip-172-23-0-61] [controller/3218034] [Main.cc@176] ML controller exiting

[2023-08-29T05:16:08,498][INFO ][o.e.x.m.p.NativeController] [ip-172-23-0-61] Native controller process has stopped - no new native processes can be started

[2023-08-29T05:16:08,579][INFO ][o.e.c.c.Coordinator ] [ip-172-23-0-61] master node [{ip-172-23-1-62}{Ie7dkFSjSMGLIQF27pKjlA}{Ayn4iNHmR6u1sYYy-z4vzA}{ip-172-23-1-62}{172.23.1.62}{172.23.1.62:9300}{cdfhilmrstw}{8.9.1}] disconnected, restarting discovery

[2023-08-29T05:16:08,921][INFO ][o.e.n.Node ] [ip-172-23-0-61] stopped

[2023-08-29T05:16:08,921][INFO ][o.e.n.Node ] [ip-172-23-0-61] closing ...

[2023-08-29T05:16:08,957][INFO ][o.e.n.Node ] [ip-172-23-0-61] closed

Do you have any idea on what might have happened? it's the first time we see a restart error.

I've a side question too: do you consider the keepalive connections to elasticsearch as "active connections" waiting for their termination? or do you only look at connections with active queries?

We have an nginx proxy forwarding the requests to elasticsearch, with a keep alive timeout of 60 seconds (now reduced to 15 seconds, with a grace period of 16 seconds - to see if happens again)

Thank you!

@stu-elastic
Copy link
Contributor

Hey @blacktek,
That tells me there was a 5 second grace period but requests kept coming into the node. Further debugging is more appropriate for the forums.

Graceful shutdown only waits for active requests. Idle connections are shut down.

The expected procedure is:

  1. External: Remove from proxy (no new requests)
  2. ES: Kill all idle connections
  3. ES: Wait for current requests to finish.
  4. Continue shutting down

@blacktek
Copy link
Author

Hi @stu-elastic ,

That tells me there was a 5 second grace period but requests kept coming into the node

This is surely possible: during the grace period (only known to elasticsearch, not to the proxy) the new requests should be rejected, according to the expected behaviour.

Am I wrong? This issue happened only once, so far.

Should I open a ticket on https://discuss.elastic.co/ ? What else can I add?

tnx

@stu-elastic
Copy link
Contributor

I wasn't accurate. There were outstanding requests that took more than 5 seconds to complete. ES stops accepting new connections as soon as it sees the sigterm and closes all idle connections as well.

Unit process 1565 (java) remains running after unit stopped.

This tells me the elasticsearch unit does is not allowing the process to fully shutdown and so is force killing it. Please check the definition of the unit and make sure it matches the timeout settings being used.

@blacktek
Copy link
Author

@stu-elastic
thank you again for your quick feedback.

Please look:
root@ip-172-23-0-61:~# systemctl cat elasticsearch|grep -i timeout

Disable timeout logic and wait until process is stopped

TimeoutStopSec=0

Allow a slow startup before the systemd notifier module kicks in to extend the timeout

TimeoutStartSec=900
root@ip-172-23-0-61:~#

WIth TimeoutStopSec=0 we should not have basically any Stop Timeout.

Tomorrow we change this setting too:
RestartSec=5s
to
RestartSec=11s

But I think it will have no effect because that timeout is the timeout to wait before restarting a service.

It seems that our configuration of the systemd unit is correct.

Do you have any other idea?

Thank you

@stu-elastic
Copy link
Contributor

@blacktek
Please follow up with a discuss post in the support forum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown >enhancement Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants