Force stop command deletes all jobs if the kubernetes job no longer exists #3750

busma13 · 2024-09-12T18:35:46Z

When force deleting a job, there is a possibility that the kubernetes job resource has been deleted, while there are other resources (pods, services, deployments) that have been orphaned. In this case K8s._deleteObjByExId() will receive undefined as the name of the job resource. The K8s.delete() function uses name to select which jobs to delete. If passed undefined or an empty string it will return ALL job resources and then delete them.

K8s.delete() needs to ensure that name is defined and not an empty string.
K8s._deleteObjByExId() must ensure there is a resource to delete before calling delete().

This bug is only in kubernetes clustering, as kubernetesV2 receives a typed object when looking up the job resource, and therefore already checks whether name is defined.

The text was updated successfully, but these errors were encountered:

godber · 2024-09-14T20:28:07Z

I was just thinking about this and I am pretty sure the steps to reproduce are easily achieved:

start two jobs, job1 with a long delay and job2 that will be accidentally stopped
call /_stop on job1
immediately call /_stop?force on job1
check state of jobs, job1 and job2 should be gone

The first call to stop deletes the all of the resources related to that exId, but leaves the pods in terminating due to their grace period. Then the second call to _stop uses force, has no Kubernetes Job for the execution controller, leading to the undefined being sent to the k8s API delete calls.

busma13 · 2024-09-17T23:11:26Z

While resolving this issue I ran into some other problems:

The force option was only deleting jobs and pods. It now deletes services, deployments, and replicasets. This will let us force delete any of these resources if K8s has a strange failure that leaves them up.
Force stopping after a regular stop is not possible in K8s. K8s marks resources for deletion, adds the metadata.deletionGracePeriodSeconds and metadata.deletionTimestamp fields, and will not send a SIGKILL until the grace period has concluded. metadata.deletionGracePeriodSeconds cannot be modified or removed, so a second call to delete with gracePeriodSeconds = 0 (equivalent of kubectl delete --force) will not be able to update the grace period and a SIGKILL will not be sent immediately. For this reason the order that we force delete resources is important. We must delete child resources before parents or they will be deleted in the background with the normal grace period. It may also be possible to change the propogationPolicy from background to orphan or foreground, but that would require other changes.

busma13 · 2024-09-18T15:11:36Z

Useful links:

https://kubernetes.io/docs/reference/kubectl/generated/kubectl_delete/
- "IMPORTANT: Force deleting pods does not wait for confirmation that the pod's processes have been terminated, which can leave those processes running until the node detects the deletion and completes graceful deletion."
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
https://kubernetes.io/docs/concepts/architecture/garbage-collection/#cascading-deletion
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#metadata
- "deletionTimestamp: a string representing an RFC 3339 date of the date and time after which this resource will be deleted. This field is set by the server when a graceful deletion is requested by the user, and is not directly settable by a client. The resource will be deleted (no longer visible from resource lists, and not reachable by name) after the time in this field except when the object has a finalizer set. In case the finalizer is set the deletion of the object is postponed at least until the finalizer is removed. Once the deletionTimestamp is set, this value may not be unset or be set further into the future, although it may be shortened or the resource may be deleted prior to this time."
https://www.acritelli.com/blog/kubectl-delete-sigkill/

busma13 · 2024-09-19T14:28:38Z

In certain scenarios calling force delete on a pod will not actually delete the process immediately.
Steps to reproduce:

Ensure teraslice.shutdown_timeout in the config is sufficiently long to see the problem. 5 minutes should work.
start a teraslice job similar to this:

  {
    "name": "peter-datagen-to-delay-to-kafka",
    "lifecycle": "persistent",
    "workers": 3,
    "assets": [
        "standard",
        "kafka"
    ],
    "operations": [
        {
            "_op": "data_generator",
            "size": 5000
        },
        {
            "_op": "delay",
            "ms": 90000
        },
        {
            "_op": "kafka_sender",
            "connection": "default",
            "topic": "rand-p-json-v1",
            "size": 5000
        }
    ]
  }

Delay is used here to keep the slice from finishing, preventing the shutdown process from completing until the delay ends.

in a separate terminal watch the logs of a pod: kubectl -n <namespace> logs <pod-name> -f | bunyan
call stop on the job: curl <teraslice-client>:<port>/jobs/<job-id>/_stop
- the worker pods will receive a SIGTERM and be placed in the terminating status.
force delete the pod you are watching: kubectl -n <namespace> delete <pod-name> --force --grace-period=0
- `kubectl get pods -n will no longer show the pod
- the logs will still be coming in, showing that the worker process is still shutting down
- once the shutdown timeout is reached the process will terminate.

This PR makes the following changes: - Ensure that `K8s._deleteObjByExId()` contains an object in `objList.items[]` before calling `K8s.deleted()`. - Throw if `K8s.delete()` receives a `name` argument that is undefined or an empty string. - Refactor `K8s._deleteObjByExId()` to delete multiple objects and not handle `force logic` - Refactor `k8s.deleteExecution() to handle `force` logic by calling `K8s._deleteObjByExId()` on multiple resources. - bump teraslice from version 2.3.1 to 2.3.2 - pin the kind image used by k8s to: `kindest/node:v1.28.12@sha256:fa0e48b1e83bb8688a5724aa7eebffbd6337abd7909ad089a2700bf08c30c6ea`. Note that the version in the image tag refers to the kubernetes server version that will be used. We should keep this in sync with prod. ref: #3750

busma13 · 2024-09-19T18:32:00Z

Here is a simpler reproduction:

dockerfile:

FROM alpine:3.18

COPY ./shutdown-script.sh /usr/local/bin/shutdown-script.sh

RUN chmod +x /usr/local/bin/shutdown-script.sh

ENTRYPOINT ["/usr/local/bin/shutdown-script.sh"]

shutdown-script.sh:

#!/bin/sh

# Trap the SIGTERM signal
trap 'echo "Received SIGTERM. Delaying shutdown..."; for i in $(seq 5 1); do echo "Waiting to shutdown... $i minute(s)"; sleep 60; done; exit 0' SIGTERM

# Keep the container running indefinitely
while :; do
  echo "Running..."
  sleep 10
done

deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: slow-shutdown
  labels:
      app.kubernetes.io/name: slow-shutdown
      app.kubernetes.io/component: master
spec:
  replicas: 1
  selector:
      matchLabels:
          app.kubernetes.io/name: slow-shutdown
          app.kubernetes.io/component: master
  template:
    metadata:
      labels:
        app.kubernetes.io/name: slow-shutdown
        app.kubernetes.io/component: master
    spec:
      containers:
      - name: slow-shutdown
        image: slow_shutdown:1.0.0
      terminationGracePeriodSeconds: 30000

create a docker image from the dockerfile:
docker build -t slow_shutdown:1.0.0 .
make sure the image is accessible to k8s. If using kind:
`kind load docker-image slow_shutdown:1.0.0 --name
create deployment:
kubectl create -n <namespace> -f deployment.yaml
watch logs: kubectl logs -n default slow-shutdown-867b6886dd-jt6tn -f | bunyan

Running...
Running...
Running...
Running...

delete deployment:
kubectl delete -n default deployment.apps/slow-shutdown
logs:

Running...
Running...
Running...
Running...
Received SIGTERM. Delaying shutdown...
Waiting to shutdown... 5 minute(s)

force delete pod:
kubectl delete -n default pod/slow-shutdown-867b6886dd-jt6tn --grace-period=0 --force=true
You should not be able to see the pod with kubectl get pods -n default but the logs should stay open for 5 minutes from the original delete

Waiting to shutdown... 5 minute(s)
Waiting to shutdown... 4 minute(s)
Waiting to shutdown... 3 minute(s)
Waiting to shutdown... 2 minute(s)
Waiting to shutdown... 1 minute(s)

busma13 · 2024-10-07T21:56:21Z

This should be resolved. Closing.

busma13 added bug k8s Applies to Teraslice in kubernetes cluster mode only. pkg/teraslice labels Sep 12, 2024

busma13 self-assigned this Sep 12, 2024

busma13 mentioned this issue Sep 12, 2024

[teraslice] fix multi-job deletion bug in force job stop #3751

Merged

busma13 closed this as completed Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force stop command deletes all jobs if the kubernetes job no longer exists #3750

Force stop command deletes all jobs if the kubernetes job no longer exists #3750

busma13 commented Sep 12, 2024

godber commented Sep 14, 2024

busma13 commented Sep 17, 2024

busma13 commented Sep 18, 2024

busma13 commented Sep 19, 2024

busma13 commented Sep 19, 2024

busma13 commented Oct 7, 2024

Force stop command deletes all jobs if the kubernetes job no longer exists #3750

Force stop command deletes all jobs if the kubernetes job no longer exists #3750

Comments

busma13 commented Sep 12, 2024

godber commented Sep 14, 2024

busma13 commented Sep 17, 2024

busma13 commented Sep 18, 2024

busma13 commented Sep 19, 2024

busma13 commented Sep 19, 2024

busma13 commented Oct 7, 2024