STS pod stuck pending until deleted #3420

Timer · 2023-02-18T17:16:40Z

Version

Karpenter Version: v0.22.1

Kubernetes Version: v1.22.16-eks-ffeb93d

Expected Behavior

Karpenter should provision a new node so that the pod is able to be scheduled.

Actual Behavior

Karpenter is ignoring the pod all together. There's no karpenter events associated with the pod, and restarting Karpenter doesn't help—the pod must be deleted.

Steps to Reproduce the Problem

I haven't been able to narrow down the exact reproduction yet, but it seems to only be affecting STS and we've experienced it a half dozen or so times over the past 2-3 weeks. I'm hoping that the logs provided can indicate the smoking gun causing this.

The only thing that stood out to me is that the pod had nominatedNodeName: ip-10-0-45-138.ap-northeast-3.compute.internal in its status, but that node no longer existed.

Along with this issue #1051, it seemed like the most likely cause.

Resource Specs and Logs

The stuck pod was 27h old at the time this was grabbed:

Pod events:

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  37m (x38 over 80m)     default-scheduler  0/59 nodes are available: 2 Insufficient memory, 3 Insufficient cpu, 56 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  28m                    default-scheduler  0/59 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 Insufficient memory, 3 Insufficient cpu, 55 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  13m                    default-scheduler  0/44 nodes are available: 2 Insufficient memory, 3 Insufficient cpu, 41 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  8m40s (x64 over 26h)   default-scheduler  (combined from similar events): 0/39 nodes are available: 2 Insufficient memory, 3 Insufficient cpu, 36 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  2m40s (x2 over 4m10s)  default-scheduler  0/37 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 Insufficient memory, 3 Insufficient cpu, 33 node(s) didn't match Pod's node affinity/selector.

Pod conditions / status:

Status:               Pending
Conditions:
  Type           Status
  PodScheduled   False

Long form status:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-02-17T12:51:34Z"
    message: '0/35 nodes are available: 2 Insufficient memory, 3 Insufficient cpu,
      32 node(s) didn''t match Pod''s node affinity/selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  nominatedNodeName: ip-10-0-45-138.ap-northeast-3.compute.internal
  phase: Pending
  qosClass: Guaranteed

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

ellistarn · 2023-02-19T23:41:59Z

Can you show the pod metadata? It might be stuck deleting.

Timer · 2023-02-20T01:10:44Z

Here's from another pod currently stuck pending (minor redactions for envs):

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2023-02-14T18:15:43Z"
  generateName: certificates-production-
  labels:
    app: certificates-production
    controller-revision-hash: certificates-production-858848d7df
    statefulset.kubernetes.io/pod-name: certificates-production-1
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:ad.datadoghq.com/certificates.check_names: {}
          f:ad.datadoghq.com/certificates.init_configs: {}
          f:ad.datadoghq.com/certificates.instances: {}
          f:ad.datadoghq.com/certificates.logs: {}
        f:generateName: {}
        f:labels:
          .: {}
          f:app: {}
          f:controller-revision-hash: {}
          f:statefulset.kubernetes.io/pod-name: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"6d2690e4-9bc0-4baa-b1ca-38fa95052950"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:affinity:
          .: {}
          f:podAntiAffinity:
            .: {}
            f:requiredDuringSchedulingIgnoredDuringExecution: {}
        f:containers:
          k:{"name":"certificates"}:
            .: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:livenessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:name: {}
            f:ports:
              .: {}
              k:{"containerPort":1338,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:protocol: {}
              k:{"containerPort":1339,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:protocol: {}
              k:{"containerPort":8080,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:protocol: {}
            f:readinessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:cpu: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
            f:volumeMounts:
              .: {}
              k:{"mountPath":"/etc/cert-encryption"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/var/cert-cache"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/var/cert-cache-old"}:
                .: {}
                f:mountPath: {}
                f:name: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:hostname: {}
        f:nodeSelector:
          .: {}
          f:kubernetes.io/arch: {}
          f:kubernetes.io/os: {}
          f:type: {}
        f:priorityClassName: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext: {}
        f:subdomain: {}
        f:terminationGracePeriodSeconds: {}
        f:volumes:
          .: {}
          k:{"name":"cert-encryption"}:
            .: {}
            f:name: {}
            f:secret:
              .: {}
              f:defaultMode: {}
              f:secretName: {}
          k:{"name":"certificates-cache"}:
            .: {}
            f:name: {}
            f:persistentVolumeClaim:
              .: {}
              f:claimName: {}
          k:{"name":"certificates-cache-xfs"}:
            .: {}
            f:name: {}
            f:persistentVolumeClaim:
              .: {}
              f:claimName: {}
    manager: kube-controller-manager
    operation: Update
    time: "2023-02-14T18:15:43Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          .: {}
          k:{"type":"PodScheduled"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:nominatedNodeName: {}
    manager: kube-scheduler
    operation: Update
    time: "2023-02-14T18:15:43Z"
  name: certificates-production-1
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: certificates-production
    uid: 6d2690e4-9bc0-4baa-b1ca-38fa95052950
  resourceVersion: "409307020"
  uid: 085f40e4-7cbd-4afa-9f47-06dba6cc68ce

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-02-14T18:15:43Z"
    message: '0/53 nodes are available: 2 node(s) didn''t match pod affinity/anti-affinity
      rules, 2 node(s) didn''t match pod anti-affinity rules, 51 node(s) didn''t match
      Pod''s node affinity/selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  nominatedNodeName: ip-10-0-181-5.ap-northeast-1.compute.internal
  phase: Pending
  qosClass: Guaranteed

Node ip-10-0-181-5.ap-northeast-1.compute.internal set in nominatedNodeName does not exist any longer.

Timer · 2023-02-20T02:32:00Z

Some additional context:

Karpenter consolidation is enabled in this cluster, and this pod was running at some point (before stuck Pending). It seems like there's a weird interplay between this STS that has a priority class higher than other services, karpenter consolidation, and this service's podAntiAffinity which forbids replicas from running on the same node.

In this stuck state, there's only 2 nodes that the STS is allowed to run on (via node selector), but a target of 3 replicas. So during consolidation Karpenter got rid of one essential nodes (probably the pod priority trying to evict other stuff on a node that Karpenter chose for deprovisioning).

Timer · 2023-02-20T03:40:49Z

Browsing through the karpenter-core code seems to indicate this could happen in a multi-pass consolidation scenario.

Since simulateScheduling in pkg/controllers/deprovisioning/helpers.go uses provisioner.GetPendingPods (which ignores nodes with nominatedNodeName, I think the reproduction would be the following:

Start with a cluster in steady state: STS with 3 replicas and podAntiAffinity forbidding co-location. There are four non-homogeneous nodes in the cluster because other deployment workloads exist.
An unrelated deployment workload is rolled and uses maxSurge: 1, maxUnavailable: 0, resulting in final pod placements being a bit different.
Karpenter decides it can consolidate a node (due to the moved pods).
Karpenter begins deprovisioning a node that contain a replica of the STS.
The Kubernetes scheduler nominates one of the other nodes (3 remaining that are Ready, 1 cordoned/draining) to get the STS workload w/ a high priority class back online.
The STS pod does not initialize quickly enough, and a second pass of Karpenter consolidation now thinks it can consolidate again since it was excluded from the provisioner.GetPendingPods results.
Karpenter drains the node that the previously interrupted STS pod was nominated to move to, and we're left in the OP reported state.

Looking at pod & node ages, this theory seems plausible:

I've seen these STS pods sometimes be delayed to initialize due to the EBS CSI driver struggling to keep up with node rotations (~4-5 minute delay, i.e. longer than it takes a new node to become Ready).

There's probably a Kubernetes-specific bug report here for it not clearing nominatedNodeName if the node is removed from the cluster, but Karpenter can probably be defensive for this scenario.

ellistarn · 2023-02-20T06:21:02Z

@tzneal is deeper on this than me.

Timer · 2023-02-20T06:45:02Z

Thanks for giving this all a read! I'll continue to try to come up with a more concrete reproduction if y'all think it'd be helpful.

Timer · 2023-02-20T06:48:28Z

FWIW seems like the invalid nominatedNodeName was a bug kubernetes/kubernetes#85677. It was fixed in 1.24 and backported to 1.23 (kubernetes/kubernetes#106816), but our clusters are on 1.21/1.22.

tzneal · 2023-02-20T13:39:21Z

Karpenter is ignoring the pod since it has a nominated node name so we expect it to schedule. We trust the scheduler to be correct, and in this case it's not. If you can upgrade to v1.23, that should resolve it.

I'll look into it a bit on the Karpenter side.

Timer · 2023-02-20T17:42:33Z

Would it be reasonable to have consolidation ignore the presence nominatedNodeName? It feels like all pending-ish pods should prevent consolidation from occurring.. but perhaps that's another discussion.

While unrelated to this OP issue, we've also seen Karpenter consolidate & have to re-provision nodes JIT for pods stuck in CrashLoopBackoff.

It feels like having "to be scheduled" pods considered in (preventing) consolidation is reasonable (would solve OP issue & node thrashing due to CrashLoopBackoff).

Should I test if Kubernetes can get itself unstuck by manually adding a node (but not manually clearing nominatedNodeName / deleting pod)?

tzneal · 2023-02-20T21:50:06Z

I looked at a few options for solving this, but none of them are great and open up possibilities for other problems (either under-provisioning or over-provisioning). Just ignoring nominated node name when provisioning can cause over-provisioning during pod evictions.

We don't treat CrashLoopBackoff pods any differently that I'm aware of. If you run into issues with this, file another issue and include Karpenter logs.

Timer · 2023-02-22T05:23:52Z

Thanks for all the time spent looking at this.

I'll close this since we've narrowed down the issue to be a Kubernetes bug (& without a great way for Karpenter to work around it).

Timer added the bug Something isn't working label Feb 18, 2023

Timer closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STS pod stuck pending until deleted #3420

STS pod stuck pending until deleted #3420

Timer commented Feb 18, 2023 •

edited

Loading

ellistarn commented Feb 19, 2023

Timer commented Feb 20, 2023 •

edited

Loading

Timer commented Feb 20, 2023 •

edited

Loading

Timer commented Feb 20, 2023 •

edited

Loading

ellistarn commented Feb 20, 2023

Timer commented Feb 20, 2023

Timer commented Feb 20, 2023 •

edited

Loading

tzneal commented Feb 20, 2023 •

edited

Loading

Timer commented Feb 20, 2023 •

edited

Loading

tzneal commented Feb 20, 2023

Timer commented Feb 22, 2023

STS pod stuck pending until deleted #3420

STS pod stuck pending until deleted #3420

Comments

Timer commented Feb 18, 2023 • edited Loading

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

ellistarn commented Feb 19, 2023

Timer commented Feb 20, 2023 • edited Loading

Timer commented Feb 20, 2023 • edited Loading

Timer commented Feb 20, 2023 • edited Loading

ellistarn commented Feb 20, 2023

Timer commented Feb 20, 2023

Timer commented Feb 20, 2023 • edited Loading

tzneal commented Feb 20, 2023 • edited Loading

Timer commented Feb 20, 2023 • edited Loading

tzneal commented Feb 20, 2023

Timer commented Feb 22, 2023

Timer commented Feb 18, 2023 •

edited

Loading

Timer commented Feb 20, 2023 •

edited

Loading

Timer commented Feb 20, 2023 •

edited

Loading

Timer commented Feb 20, 2023 •

edited

Loading

Timer commented Feb 20, 2023 •

edited

Loading

tzneal commented Feb 20, 2023 •

edited

Loading

Timer commented Feb 20, 2023 •

edited

Loading