-
-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a gracefulShutdownTimeout option for 0-downtime deployment on Kubernetes #2421
Comments
Note, I could do a PR to add this feature, once I have the go from maintainers :) |
@Lp-Francois Thanks for the comprehensive write-up, I really appreciate it. I'd welcome a PR for this. However, I am against configuring this via env variable. Let the user of this lib decide how this configuration option should be set (e.g. via env var, hardcoded, yaml file, whatever). Let's just add a configuration option like this: TerminusModule.forRoot({
gracefulShutdownTimeoutMs: Number(process.env['GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS']) * 1000
}) Also, I'd prefer using milliseconds instead of seconds as the unit for this option, so it's more versatile. Per default, I'd set |
Agree with both: milliseconds, forRoot, and default to 0, it makes sense to avoid breaking any existing setup :) Will prepare something |
set default option to 0ms Related issue: nestjs#2421
@BrunnerLivio PR ready ✅ |
#2422 |
The merged feature implements waiting, but it doesn't implement this:
How are folks doing this? I'm noticing some requests are still making it to my pods after they receive |
Never mind. I see the library does return a 503 after receiving
My probe is calling the wrong endpoint. 🤦🏾 |
Is there an existing issue that is already proposing this?
Is your feature request related to a problem? Please describe it
When load testing on a liveness endpoint of a
Nestjs
app usingterminus
while doing a rolling update on Kubernetes, I noticed a small percentage of failed requests. This percentage of fail errors is coming from a non 0-downtime deployment.Here are some info about the app:
terminus
exposes 2 endpoints: liveness and readinessapp.enableShutdownHooks();
dumb-init
The expected graceful shutdown behaviour I would expect from a production-ready NestJs app would be:
readiness
probe to fail with 503, to tell the orchestrator to stop sending requestsa.
Resources:
Describe the solution you'd like
An option to allow this package to wait X seconds after setting the probes to fail before starting shutting down the web server.
Users could pass an argument to Terminus as an option, or even better with as an environment variable
GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS
. When creating the Kubernetes manifest, you could make sure to always pass to the pod the GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS variable containing the same or higher delay than the readiness interval.I write about Kubernetes, but it would be the case for all kind of orchestrators I believe.
An example of NestJS provider implementing this feature:
Teachability, documentation, adoption, migration strategy
Users can smoothly adopt this new feature by just modifying their environment variables.
Specifically, they should set the
GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS
variable to their desired timeout duration (in seconds) for graceful shutdown when a SIGTERM signal is received. This adjustment ensures that the feature aligns with their target Kubernetes readiness interval.What is the motivation / use case for changing the behavior?
The change aims to improve the resilience and stability of NestJS apps during rolling updates in Kubernetes, minimizing service interruptions. By introducing a 'graceful shutdown' timeout period, the apps can appropriately manage termination signals and orderly close connections. This ensures a seamless user experience and reduces failures due to non-zero-downtime deployment.
The text was updated successfully, but these errors were encountered: