Pod `StartError` seems to be ignored #4011

alexec · 2020-09-13T21:25:42Z

Summary

Pod failed to start - workflows should have errored. But remained Running

Diagnostics

What version of Argo Workflows are you running? master

    loops-sequence-vvb9d-4095976472:
      id: loops-sequence-vvb9d-4095976472
      name: 'loops-sequence-vvb9d[0].sequence-count(3:3)'
      displayName: 'sequence-count(3:3)'
      type: Pod
      templateName: echo
      templateScope: namespaced/loops-sequence
      phase: Running
      boundaryID: loops-sequence-vvb9d
      startedAt: '2020-09-13T21:15:01Z'
      finishedAt: null
      estimatedDuration: 19000000000
      inputs:
        parameters:
          - name: msg
            value: '3'
      hostNodeName: k3d-k3s-default-server

apiVersion: v1
kind: Pod
metadata:
  annotations:
    workflows.argoproj.io/node-name: loops-sequence-vvb9d[0].sequence-count(3:3)
    workflows.argoproj.io/template: '{"name":"echo","arguments":{},"inputs":{"parameters":[{"name":"msg","value":"3"}]},"outputs":{},"metadata":{},"container":{"name":"","image":"alpine:latest","command":["echo","3"],"resources":{}},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"loops-sequence-vvb9d/loops-sequence-vvb9d-4095976472"}}}'
  labels:
    workflows.argoproj.io/completed: "false"
    workflows.argoproj.io/workflow: loops-sequence-vvb9d
  name: loops-sequence-vvb9d-4095976472
  namespace: argo
  ownerReferences:
    - apiVersion: argoproj.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: Workflow
      name: loops-sequence-vvb9d
      uid: 0d676e02-8145-4b5c-a3a6-dcadb76a7841
spec:
  containers:
    - command:
        - argoexec
        - wait
      env:
        - name: ARGO_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: ARGO_CONTAINER_RUNTIME_EXECUTOR
          value: pns
      image: argoproj/argoexec:latest
      imagePullPolicy: IfNotPresent
      name: wait
      resources:
        limits:
          cpu: 500m
          memory: 128Mi
        requests:
          cpu: 100m
          memory: 64Mi
      securityContext:
        capabilities:
          add:
            - SYS_PTRACE
            - SYS_CHROOT
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /argo/podmetadata
          name: podmetadata
        - mountPath: /argo/secret/my-minio-cred
          name: my-minio-cred
          readOnly: true
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: default-token-tf5qr
          readOnly: true
    - command:
        - echo
        - "3"
      image: alpine:latest
      imagePullPolicy: Always
      name: main
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: default-token-tf5qr
          readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: k3d-k3s-default-server
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  shareProcessNamespace: true
  terminationGracePeriodSeconds: 30
  tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  volumes:
    - downwardAPI:
        defaultMode: 420
        items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.annotations
            path: annotations
      name: podmetadata
    - name: my-minio-cred
      secret:
        defaultMode: 420
        items:
          - key: accesskey
            path: accesskey
          - key: secretkey
            path: secretkey
        secretName: my-minio-cred
    - name: default-token-tf5qr
      secret:
        defaultMode: 420
        secretName: default-token-tf5qr
status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      message: 'containers with unready status: [main]'
      reason: ContainersNotReady
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      message: 'containers with unready status: [main]'
      reason: ContainersNotReady
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      status: "True"
      type: PodScheduled
  containerStatuses:
    - containerID: containerd://8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76
      image: docker.io/library/alpine:latest
      imageID: docker.io/library/alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
      lastState: {}
      name: main
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: containerd://8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76
          exitCode: 128
          finishedAt: "2020-09-13T21:15:45Z"
          message: 'failed to create containerd task: failed to start io pipe copy:
          unable to copy pipes: containerd-shim: opening w/o fifo "/run/k3s/containerd/io.containerd.grpc.v1.cri/containers/8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76/io/236879711/8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76-stdout"
          failed: context deadline exceeded'
          reason: StartError
          startedAt: "1970-01-01T00:00:00Z"
    - containerID: containerd://35a12933d32e2150e5c62940f19a7b9cc57b1fbcdc11a343fe2f0f7b306694d0
      image: docker.io/argoproj/argoexec:latest
      imageID: sha256:76c472387dfe8d5cb8126b494dbe90ae9c59b9389db2761bf75db0a2d60cfbae
      lastState: {}
      name: wait
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2020-09-13T21:15:06Z"
  hostIP: 172.18.0.2
  phase: Running
  podIP: 10.42.0.204
  podIPs:
    - ip: 10.42.0.204
  qosClass: Burstable
  startTime: "2020-09-13T21:15:01Z"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

alexec · 2020-09-13T21:39:45Z

Fix operator.go#1073:

				log.Infof("Processing ready daemon pod: %v", pod.ObjectMeta.SelfLink)
			}

			for _, s := range append(pod.Status.InitContainerStatuses, pod.Status.ContainerStatuses...) {
				t := s.State.Terminated
				if t != nil && t.ExitCode > 0 {
					newPhase, message = inferFailedReason(pod)
				}
			}

alexec · 2020-09-14T18:43:07Z

This has not been seen in the wild. Maybe a rare K3S only issue, for example. Fix might create new bugs.

doubliez · 2025-02-06T20:58:42Z

I'm running into this exact issue on a production EKS cluster. Sometimes, a pod within a workflow will fail with StartError and a low-level OCI runtime error:

Warning  Failed     46m   kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to freeze: unknown

When this happens, the workflow in Argo remains stuck in Running state indefinitely, which is big problem. I'd like Argo to automatically catch this error and retry according to the retry strategy defined in the workflow.

Granted this error shouldn't happen in the first place, and it seems related to a specific AMI or kernel version (see aws/karpenter-provider-aws#7510), but when it does happen it's problematic that there's no way to handle it in Argo.

alexec added the type/bug label Sep 13, 2020

alexec self-assigned this Sep 14, 2020

alexec added a commit to alexec/argo-workflows that referenced this issue Sep 14, 2020

fix(controller): Fail node on StartError. Fixes argoproj#4011

c9198cc

This was referenced Sep 14, 2020

fix(controller): Fail node on StartError. Fixes #4011 #4020

Closed

stuck with deleted pods #3857

Closed

alexec added the wontfix label Sep 14, 2020

alexec removed their assignment Sep 14, 2020

stale bot removed the wontfix label Sep 14, 2020

alexec closed this as completed Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod `StartError` seems to be ignored #4011

Pod `StartError` seems to be ignored #4011

alexec commented Sep 13, 2020 •

edited

Loading

alexec commented Sep 13, 2020

alexec commented Sep 14, 2020

doubliez commented Feb 6, 2025

Pod StartError seems to be ignored #4011

Pod StartError seems to be ignored #4011

Comments

alexec commented Sep 13, 2020 • edited Loading

Summary

Diagnostics

alexec commented Sep 13, 2020

alexec commented Sep 14, 2020

doubliez commented Feb 6, 2025

Pod `StartError` seems to be ignored #4011

Pod `StartError` seems to be ignored #4011

alexec commented Sep 13, 2020 •

edited

Loading