Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a gracefulShutdownTimeout option for 0-downtime deployment on Kubernetes #2421

Closed
1 task done
Lp-Francois opened this issue Nov 22, 2023 · 8 comments
Closed
1 task done

Comments

@Lp-Francois
Copy link
Contributor

Is there an existing issue that is already proposing this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe it

When load testing on a liveness endpoint of a Nestjs app using terminus while doing a rolling update on Kubernetes, I noticed a small percentage of failed requests. This percentage of fail errors is coming from a non 0-downtime deployment.

Here are some info about the app:

  • deployed on Kubernetes with liveness/readiness probes
  • terminus exposes 2 endpoints: liveness and readiness
  • NestJS has app.enableShutdownHooks();
  • it starts the app in a docker container using dumb-init

The expected graceful shutdown behaviour I would expect from a production-ready NestJs app would be:

  1. on receiving SIGTERM signal, set readiness probe to fail with 503, to tell the orchestrator to stop sending requests
  2. Wait X seconds to be sure traffic stops being forwarded to the app by Kubernetes (should match the interval of the readiness probe + few seconds, to be sure the orchestrator is aware the pod should stop receive traffic),
    a. ⚠️ currently, it doesn't wait this time. It sets the readiness to fail an then proceed to close the web server right away.
  3. proceed to close the webserver (process last requests if there are still some long ones running)
  4. proceed to close database connections and others connections
  5. Shutdown the app

Resources:

Describe the solution you'd like

An option to allow this package to wait X seconds after setting the probes to fail before starting shutting down the web server.

Users could pass an argument to Terminus as an option, or even better with as an environment variable GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS. When creating the Kubernetes manifest, you could make sure to always pass to the pod the GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS variable containing the same or higher delay than the readiness interval.

I write about Kubernetes, but it would be the case for all kind of orchestrators I believe.

An example of NestJS provider implementing this feature:

import { BeforeApplicationShutdown, Injectable } from '@nestjs/common';
import { LoggerService } from '@org/shared/custom-logger';

// eslint-disable-next-line no-promise-executor-return
const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));

const gracefulShutdownTimeoutInSeconds =
  Number(process.env['GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS']) || 10;

@Injectable()
export class GracefulShutdownService implements BeforeApplicationShutdown {
  constructor(private readonly logger: LoggerService) {}

  async beforeApplicationShutdown(signal: string) {
    this.logger.info(`Received termination signal ${signal}`);

    if (signal === 'SIGTERM') {
      this.logger.info(
        `Await ${gracefulShutdownTimeoutInSeconds} seconds before shutdown`
      );
      await sleep(gracefulShutdownTimeoutInSeconds * 1000);
      this.logger.info(`Timeout reached, shutdown now`);
    }
  }
}

Teachability, documentation, adoption, migration strategy

Users can smoothly adopt this new feature by just modifying their environment variables.

Specifically, they should set the GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS variable to their desired timeout duration (in seconds) for graceful shutdown when a SIGTERM signal is received. This adjustment ensures that the feature aligns with their target Kubernetes readiness interval.

What is the motivation / use case for changing the behavior?

The change aims to improve the resilience and stability of NestJS apps during rolling updates in Kubernetes, minimizing service interruptions. By introducing a 'graceful shutdown' timeout period, the apps can appropriately manage termination signals and orderly close connections. This ensures a seamless user experience and reduces failures due to non-zero-downtime deployment.

@Lp-Francois
Copy link
Contributor Author

Note, I could do a PR to add this feature, once I have the go from maintainers :)

@BrunnerLivio
Copy link
Member

@Lp-Francois Thanks for the comprehensive write-up, I really appreciate it.

I'd welcome a PR for this. However, I am against configuring this via env variable. Let the user of this lib decide how this configuration option should be set (e.g. via env var, hardcoded, yaml file, whatever). Let's just add a configuration option like this:

TerminusModule.forRoot({
  gracefulShutdownTimeoutMs: Number(process.env['GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS'])  * 1000
})

Also, I'd prefer using milliseconds instead of seconds as the unit for this option, so it's more versatile. Per default, I'd set gracefulShutdownTimeoutMs = 0 so that no existing consumer might break for some reason. I'll consider changing the default to a more appropriate value once we release the next major version.

@Lp-Francois
Copy link
Contributor Author

@Lp-Francois Thanks for the comprehensive write-up, I really appreciate it.

I'd welcome a PR for this. However, I am against configuring this via env variable. Let the user of this lib decide how this configuration option should be set (e.g. via env var, hardcoded, yaml file, whatever). Let's just add a configuration option like this:

TerminusModule.forRoot({
  gracefulShutdownTimeoutMs: Number(process.env['GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS'])  * 1000
})

Also, I'd prefer using milliseconds instead of seconds as the unit for this option, so it's more versatile. Per default, I'd set gracefulShutdownTimeoutMs = 0 so that no existing consumer might break for some reason. I'll consider changing the default to a more appropriate value once we release the next major version.

Agree with both: milliseconds, forRoot, and default to 0, it makes sense to avoid breaking any existing setup :)

Will prepare something

Lp-Francois added a commit to Lp-Francois/terminus that referenced this issue Nov 22, 2023
@Lp-Francois
Copy link
Contributor Author

@BrunnerLivio PR ready ✅

@Lp-Francois
Copy link
Contributor Author

#2422
PR merged, I close the issue

@BrunnerLivio
Copy link
Member

Released with v10.2.0 🎉

@clintonb
Copy link

clintonb commented Dec 5, 2024

The merged feature implements waiting, but it doesn't implement this:

on receiving SIGTERM signal, set readiness probe to fail with 503, to tell the orchestrator to stop sending requests

How are folks doing this? I'm noticing some requests are still making it to my pods after they receive SIGTERM, and suspect its because the passing readiness probe is resulting in the pods not being quickly removed from the NEG/load balancer.

@clintonb
Copy link

clintonb commented Dec 6, 2024

Never mind. I see the library does return a 503 after receiving SIGTERM:

status = this.isShuttingDown ? 'shutting_down' : status;
.

My probe is calling the wrong endpoint. 🤦🏾

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants