Adding a gracefulShutdownTimeout option for 0-downtime deployment on Kubernetes #2421

Lp-Francois · 2023-11-22T12:30:43Z

Is there an existing issue that is already proposing this?

I have searched the existing issues

Is your feature request related to a problem? Please describe it

When load testing on a liveness endpoint of a Nestjs app using terminus while doing a rolling update on Kubernetes, I noticed a small percentage of failed requests. This percentage of fail errors is coming from a non 0-downtime deployment.

Here are some info about the app:

deployed on Kubernetes with liveness/readiness probes
terminus exposes 2 endpoints: liveness and readiness
NestJS has app.enableShutdownHooks();
it starts the app in a docker container using dumb-init

The expected graceful shutdown behaviour I would expect from a production-ready NestJs app would be:

on receiving SIGTERM signal, set readiness probe to fail with 503, to tell the orchestrator to stop sending requests
Wait X seconds to be sure traffic stops being forwarded to the app by Kubernetes (should match the interval of the readiness probe + few seconds, to be sure the orchestrator is aware the pod should stop receive traffic),
a. ⚠️ currently, it doesn't wait this time. It sets the readiness to fail an then proceed to close the web server right away.
proceed to close the webserver (process last requests if there are still some long ones running)
proceed to close database connections and others connections
Shutdown the app

Resources:

https://learnk8s.io/graceful-shutdown

Describe the solution you'd like

An option to allow this package to wait X seconds after setting the probes to fail before starting shutting down the web server.

Users could pass an argument to Terminus as an option, or even better with as an environment variable GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS. When creating the Kubernetes manifest, you could make sure to always pass to the pod the GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS variable containing the same or higher delay than the readiness interval.

I write about Kubernetes, but it would be the case for all kind of orchestrators I believe.

An example of NestJS provider implementing this feature:

import { BeforeApplicationShutdown, Injectable } from '@nestjs/common';
import { LoggerService } from '@org/shared/custom-logger';

// eslint-disable-next-line no-promise-executor-return
const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));

const gracefulShutdownTimeoutInSeconds =
  Number(process.env['GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS']) || 10;

@Injectable()
export class GracefulShutdownService implements BeforeApplicationShutdown {
  constructor(private readonly logger: LoggerService) {}

  async beforeApplicationShutdown(signal: string) {
    this.logger.info(`Received termination signal ${signal}`);

    if (signal === 'SIGTERM') {
      this.logger.info(
        `Await ${gracefulShutdownTimeoutInSeconds} seconds before shutdown`
      );
      await sleep(gracefulShutdownTimeoutInSeconds * 1000);
      this.logger.info(`Timeout reached, shutdown now`);
    }
  }
}

Teachability, documentation, adoption, migration strategy

Users can smoothly adopt this new feature by just modifying their environment variables.

Specifically, they should set the GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS variable to their desired timeout duration (in seconds) for graceful shutdown when a SIGTERM signal is received. This adjustment ensures that the feature aligns with their target Kubernetes readiness interval.

What is the motivation / use case for changing the behavior?

The change aims to improve the resilience and stability of NestJS apps during rolling updates in Kubernetes, minimizing service interruptions. By introducing a 'graceful shutdown' timeout period, the apps can appropriately manage termination signals and orderly close connections. This ensures a seamless user experience and reduces failures due to non-zero-downtime deployment.

The text was updated successfully, but these errors were encountered:

Lp-Francois · 2023-11-22T12:31:19Z

Note, I could do a PR to add this feature, once I have the go from maintainers :)

BrunnerLivio · 2023-11-22T20:55:59Z

@Lp-Francois Thanks for the comprehensive write-up, I really appreciate it.

I'd welcome a PR for this. However, I am against configuring this via env variable. Let the user of this lib decide how this configuration option should be set (e.g. via env var, hardcoded, yaml file, whatever). Let's just add a configuration option like this:

TerminusModule.forRoot({
  gracefulShutdownTimeoutMs: Number(process.env['GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS'])  * 1000
})

Also, I'd prefer using milliseconds instead of seconds as the unit for this option, so it's more versatile. Per default, I'd set gracefulShutdownTimeoutMs = 0 so that no existing consumer might break for some reason. I'll consider changing the default to a more appropriate value once we release the next major version.

Lp-Francois · 2023-11-22T21:33:20Z

@Lp-Francois Thanks for the comprehensive write-up, I really appreciate it.

I'd welcome a PR for this. However, I am against configuring this via env variable. Let the user of this lib decide how this configuration option should be set (e.g. via env var, hardcoded, yaml file, whatever). Let's just add a configuration option like this:
TerminusModule.forRoot({
  gracefulShutdownTimeoutMs: Number(process.env['GRACEFUL_SHUTDOWN_TIMEOUT_SECONDS'])  * 1000
})
Also, I'd prefer using milliseconds instead of seconds as the unit for this option, so it's more versatile. Per default, I'd set gracefulShutdownTimeoutMs = 0 so that no existing consumer might break for some reason. I'll consider changing the default to a more appropriate value once we release the next major version.

Agree with both: milliseconds, forRoot, and default to 0, it makes sense to avoid breaking any existing setup :)

Will prepare something

set default option to 0ms Related issue: nestjs#2421

Lp-Francois · 2023-11-22T22:52:45Z

@BrunnerLivio PR ready ✅

Lp-Francois · 2023-11-26T18:28:11Z

#2422
PR merged, I close the issue

BrunnerLivio · 2023-11-27T13:23:08Z

Released with v10.2.0 🎉

clintonb · 2024-12-05T23:27:47Z

The merged feature implements waiting, but it doesn't implement this:

on receiving SIGTERM signal, set readiness probe to fail with 503, to tell the orchestrator to stop sending requests

How are folks doing this? I'm noticing some requests are still making it to my pods after they receive SIGTERM, and suspect its because the passing readiness probe is resulting in the pods not being quickly removed from the NEG/load balancer.

clintonb · 2024-12-06T00:57:17Z

Never mind. I see the library does return a 503 after receiving SIGTERM:

terminus/lib/health-check/health-check-executor.service.ts

Line 102 in 596df60

status = this.isShuttingDown ? 'shutting_down' : status;

.

My probe is calling the wrong endpoint. 🤦🏾

Lp-Francois added the type: feature label Nov 22, 2023

Lp-Francois added a commit to Lp-Francois/terminus that referenced this issue Nov 22, 2023

feat: add graceful-shutdown timeout service with tests

dd71df6

set default option to 0ms Related issue: nestjs#2421

Lp-Francois mentioned this issue Nov 22, 2023

feat: graceful shutdown timeout #2422

Merged

12 tasks

Lp-Francois closed this as completed Nov 26, 2023

Lp-Francois mentioned this issue Jan 17, 2025

Graceful shutdown isn't working like expected #2569

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a gracefulShutdownTimeout option for 0-downtime deployment on Kubernetes #2421

Adding a gracefulShutdownTimeout option for 0-downtime deployment on Kubernetes #2421

Lp-Francois commented Nov 22, 2023

Lp-Francois commented Nov 22, 2023

BrunnerLivio commented Nov 22, 2023

Lp-Francois commented Nov 22, 2023

Lp-Francois commented Nov 22, 2023

Lp-Francois commented Nov 26, 2023

BrunnerLivio commented Nov 27, 2023

clintonb commented Dec 5, 2024

clintonb commented Dec 6, 2024

Adding a gracefulShutdownTimeout option for 0-downtime deployment on Kubernetes #2421

Adding a gracefulShutdownTimeout option for 0-downtime deployment on Kubernetes #2421

Comments

Lp-Francois commented Nov 22, 2023

Is there an existing issue that is already proposing this?

Is your feature request related to a problem? Please describe it

Describe the solution you'd like

Teachability, documentation, adoption, migration strategy

What is the motivation / use case for changing the behavior?

Lp-Francois commented Nov 22, 2023

BrunnerLivio commented Nov 22, 2023

Lp-Francois commented Nov 22, 2023

Lp-Francois commented Nov 22, 2023

Lp-Francois commented Nov 26, 2023

BrunnerLivio commented Nov 27, 2023

clintonb commented Dec 5, 2024

clintonb commented Dec 6, 2024