Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STS pod stuck pending until deleted #3420

Closed
Timer opened this issue Feb 18, 2023 · 11 comments
Closed

STS pod stuck pending until deleted #3420

Timer opened this issue Feb 18, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@Timer
Copy link

Timer commented Feb 18, 2023

Version

Karpenter Version: v0.22.1

Kubernetes Version: v1.22.16-eks-ffeb93d

Expected Behavior

Karpenter should provision a new node so that the pod is able to be scheduled.

Actual Behavior

Karpenter is ignoring the pod all together. There's no karpenter events associated with the pod, and restarting Karpenter doesn't help—the pod must be deleted.

Steps to Reproduce the Problem

I haven't been able to narrow down the exact reproduction yet, but it seems to only be affecting STS and we've experienced it a half dozen or so times over the past 2-3 weeks. I'm hoping that the logs provided can indicate the smoking gun causing this.

The only thing that stood out to me is that the pod had nominatedNodeName: ip-10-0-45-138.ap-northeast-3.compute.internal in its status, but that node no longer existed.

Along with this issue #1051, it seemed like the most likely cause.

Resource Specs and Logs

The stuck pod was 27h old at the time this was grabbed:

Pod events:

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  37m (x38 over 80m)     default-scheduler  0/59 nodes are available: 2 Insufficient memory, 3 Insufficient cpu, 56 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  28m                    default-scheduler  0/59 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 Insufficient memory, 3 Insufficient cpu, 55 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  13m                    default-scheduler  0/44 nodes are available: 2 Insufficient memory, 3 Insufficient cpu, 41 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  8m40s (x64 over 26h)   default-scheduler  (combined from similar events): 0/39 nodes are available: 2 Insufficient memory, 3 Insufficient cpu, 36 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  2m40s (x2 over 4m10s)  default-scheduler  0/37 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 Insufficient memory, 3 Insufficient cpu, 33 node(s) didn't match Pod's node affinity/selector.

Pod conditions / status:

Status:               Pending
Conditions:
  Type           Status
  PodScheduled   False 

Long form status:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-02-17T12:51:34Z"
    message: '0/35 nodes are available: 2 Insufficient memory, 3 Insufficient cpu,
      32 node(s) didn''t match Pod''s node affinity/selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  nominatedNodeName: ip-10-0-45-138.ap-northeast-3.compute.internal
  phase: Pending
  qosClass: Guaranteed

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@Timer Timer added the bug Something isn't working label Feb 18, 2023
@ellistarn
Copy link
Contributor

Can you show the pod metadata? It might be stuck deleting.

@Timer
Copy link
Author

Timer commented Feb 20, 2023

Here's from another pod currently stuck pending (minor redactions for envs):

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2023-02-14T18:15:43Z"
  generateName: certificates-production-
  labels:
    app: certificates-production
    controller-revision-hash: certificates-production-858848d7df
    statefulset.kubernetes.io/pod-name: certificates-production-1
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:ad.datadoghq.com/certificates.check_names: {}
          f:ad.datadoghq.com/certificates.init_configs: {}
          f:ad.datadoghq.com/certificates.instances: {}
          f:ad.datadoghq.com/certificates.logs: {}
        f:generateName: {}
        f:labels:
          .: {}
          f:app: {}
          f:controller-revision-hash: {}
          f:statefulset.kubernetes.io/pod-name: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"6d2690e4-9bc0-4baa-b1ca-38fa95052950"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:affinity:
          .: {}
          f:podAntiAffinity:
            .: {}
            f:requiredDuringSchedulingIgnoredDuringExecution: {}
        f:containers:
          k:{"name":"certificates"}:
            .: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:livenessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:name: {}
            f:ports:
              .: {}
              k:{"containerPort":1338,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:protocol: {}
              k:{"containerPort":1339,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:protocol: {}
              k:{"containerPort":8080,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:protocol: {}
            f:readinessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:cpu: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
            f:volumeMounts:
              .: {}
              k:{"mountPath":"/etc/cert-encryption"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/var/cert-cache"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/var/cert-cache-old"}:
                .: {}
                f:mountPath: {}
                f:name: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:hostname: {}
        f:nodeSelector:
          .: {}
          f:kubernetes.io/arch: {}
          f:kubernetes.io/os: {}
          f:type: {}
        f:priorityClassName: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext: {}
        f:subdomain: {}
        f:terminationGracePeriodSeconds: {}
        f:volumes:
          .: {}
          k:{"name":"cert-encryption"}:
            .: {}
            f:name: {}
            f:secret:
              .: {}
              f:defaultMode: {}
              f:secretName: {}
          k:{"name":"certificates-cache"}:
            .: {}
            f:name: {}
            f:persistentVolumeClaim:
              .: {}
              f:claimName: {}
          k:{"name":"certificates-cache-xfs"}:
            .: {}
            f:name: {}
            f:persistentVolumeClaim:
              .: {}
              f:claimName: {}
    manager: kube-controller-manager
    operation: Update
    time: "2023-02-14T18:15:43Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          .: {}
          k:{"type":"PodScheduled"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:nominatedNodeName: {}
    manager: kube-scheduler
    operation: Update
    time: "2023-02-14T18:15:43Z"
  name: certificates-production-1
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: certificates-production
    uid: 6d2690e4-9bc0-4baa-b1ca-38fa95052950
  resourceVersion: "409307020"
  uid: 085f40e4-7cbd-4afa-9f47-06dba6cc68ce
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-02-14T18:15:43Z"
    message: '0/53 nodes are available: 2 node(s) didn''t match pod affinity/anti-affinity
      rules, 2 node(s) didn''t match pod anti-affinity rules, 51 node(s) didn''t match
      Pod''s node affinity/selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  nominatedNodeName: ip-10-0-181-5.ap-northeast-1.compute.internal
  phase: Pending
  qosClass: Guaranteed

Node ip-10-0-181-5.ap-northeast-1.compute.internal set in nominatedNodeName does not exist any longer.

@Timer
Copy link
Author

Timer commented Feb 20, 2023

Some additional context:

Karpenter consolidation is enabled in this cluster, and this pod was running at some point (before stuck Pending). It seems like there's a weird interplay between this STS that has a priority class higher than other services, karpenter consolidation, and this service's podAntiAffinity which forbids replicas from running on the same node.

In this stuck state, there's only 2 nodes that the STS is allowed to run on (via node selector), but a target of 3 replicas. So during consolidation Karpenter got rid of one essential nodes (probably the pod priority trying to evict other stuff on a node that Karpenter chose for deprovisioning).

@Timer
Copy link
Author

Timer commented Feb 20, 2023

Browsing through the karpenter-core code seems to indicate this could happen in a multi-pass consolidation scenario.

Since simulateScheduling in pkg/controllers/deprovisioning/helpers.go uses provisioner.GetPendingPods (which ignores nodes with nominatedNodeName, I think the reproduction would be the following:

  1. Start with a cluster in steady state: STS with 3 replicas and podAntiAffinity forbidding co-location. There are four non-homogeneous nodes in the cluster because other deployment workloads exist.
  2. An unrelated deployment workload is rolled and uses maxSurge: 1, maxUnavailable: 0, resulting in final pod placements being a bit different.
  3. Karpenter decides it can consolidate a node (due to the moved pods).
  4. Karpenter begins deprovisioning a node that contain a replica of the STS.
  5. The Kubernetes scheduler nominates one of the other nodes (3 remaining that are Ready, 1 cordoned/draining) to get the STS workload w/ a high priority class back online.
  6. The STS pod does not initialize quickly enough, and a second pass of Karpenter consolidation now thinks it can consolidate again since it was excluded from the provisioner.GetPendingPods results.
  7. Karpenter drains the node that the previously interrupted STS pod was nominated to move to, and we're left in the OP reported state.

Looking at pod & node ages, this theory seems plausible:

CleanShot 2023-02-19 at 22 34 23@2x

I've seen these STS pods sometimes be delayed to initialize due to the EBS CSI driver struggling to keep up with node rotations (~4-5 minute delay, i.e. longer than it takes a new node to become Ready).


There's probably a Kubernetes-specific bug report here for it not clearing nominatedNodeName if the node is removed from the cluster, but Karpenter can probably be defensive for this scenario.

@ellistarn
Copy link
Contributor

@tzneal is deeper on this than me.

@Timer
Copy link
Author

Timer commented Feb 20, 2023

Thanks for giving this all a read! I'll continue to try to come up with a more concrete reproduction if y'all think it'd be helpful.

@Timer
Copy link
Author

Timer commented Feb 20, 2023

FWIW seems like the invalid nominatedNodeName was a bug kubernetes/kubernetes#85677. It was fixed in 1.24 and backported to 1.23 (kubernetes/kubernetes#106816), but our clusters are on 1.21/1.22.

@tzneal
Copy link
Contributor

tzneal commented Feb 20, 2023

Karpenter is ignoring the pod since it has a nominated node name so we expect it to schedule. We trust the scheduler to be correct, and in this case it's not. If you can upgrade to v1.23, that should resolve it.

I'll look into it a bit on the Karpenter side.

@Timer
Copy link
Author

Timer commented Feb 20, 2023

Would it be reasonable to have consolidation ignore the presence nominatedNodeName? It feels like all pending-ish pods should prevent consolidation from occurring.. but perhaps that's another discussion.

While unrelated to this OP issue, we've also seen Karpenter consolidate & have to re-provision nodes JIT for pods stuck in CrashLoopBackoff.

It feels like having "to be scheduled" pods considered in (preventing) consolidation is reasonable (would solve OP issue & node thrashing due to CrashLoopBackoff).


Should I test if Kubernetes can get itself unstuck by manually adding a node (but not manually clearing nominatedNodeName / deleting pod)?

@tzneal
Copy link
Contributor

tzneal commented Feb 20, 2023

I looked at a few options for solving this, but none of them are great and open up possibilities for other problems (either under-provisioning or over-provisioning). Just ignoring nominated node name when provisioning can cause over-provisioning during pod evictions.

We don't treat CrashLoopBackoff pods any differently that I'm aware of. If you run into issues with this, file another issue and include Karpenter logs.

@Timer
Copy link
Author

Timer commented Feb 22, 2023

Thanks for all the time spent looking at this.

I'll close this since we've narrowed down the issue to be a Kubernetes bug (& without a great way for Karpenter to work around it).

@Timer Timer closed this as completed Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants