Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod StartError seems to be ignored #4011

Closed
alexec opened this issue Sep 13, 2020 · 3 comments
Closed

Pod StartError seems to be ignored #4011

alexec opened this issue Sep 13, 2020 · 3 comments
Labels

Comments

@alexec
Copy link
Contributor

alexec commented Sep 13, 2020

Summary

Pod failed to start - workflows should have errored. But remained Running

Diagnostics

What version of Argo Workflows are you running? master

    loops-sequence-vvb9d-4095976472:
      id: loops-sequence-vvb9d-4095976472
      name: 'loops-sequence-vvb9d[0].sequence-count(3:3)'
      displayName: 'sequence-count(3:3)'
      type: Pod
      templateName: echo
      templateScope: namespaced/loops-sequence
      phase: Running
      boundaryID: loops-sequence-vvb9d
      startedAt: '2020-09-13T21:15:01Z'
      finishedAt: null
      estimatedDuration: 19000000000
      inputs:
        parameters:
          - name: msg
            value: '3'
      hostNodeName: k3d-k3s-default-server
apiVersion: v1
kind: Pod
metadata:
  annotations:
    workflows.argoproj.io/node-name: loops-sequence-vvb9d[0].sequence-count(3:3)
    workflows.argoproj.io/template: '{"name":"echo","arguments":{},"inputs":{"parameters":[{"name":"msg","value":"3"}]},"outputs":{},"metadata":{},"container":{"name":"","image":"alpine:latest","command":["echo","3"],"resources":{}},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"loops-sequence-vvb9d/loops-sequence-vvb9d-4095976472"}}}'
  labels:
    workflows.argoproj.io/completed: "false"
    workflows.argoproj.io/workflow: loops-sequence-vvb9d
  name: loops-sequence-vvb9d-4095976472
  namespace: argo
  ownerReferences:
    - apiVersion: argoproj.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: Workflow
      name: loops-sequence-vvb9d
      uid: 0d676e02-8145-4b5c-a3a6-dcadb76a7841
spec:
  containers:
    - command:
        - argoexec
        - wait
      env:
        - name: ARGO_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: ARGO_CONTAINER_RUNTIME_EXECUTOR
          value: pns
      image: argoproj/argoexec:latest
      imagePullPolicy: IfNotPresent
      name: wait
      resources:
        limits:
          cpu: 500m
          memory: 128Mi
        requests:
          cpu: 100m
          memory: 64Mi
      securityContext:
        capabilities:
          add:
            - SYS_PTRACE
            - SYS_CHROOT
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /argo/podmetadata
          name: podmetadata
        - mountPath: /argo/secret/my-minio-cred
          name: my-minio-cred
          readOnly: true
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: default-token-tf5qr
          readOnly: true
    - command:
        - echo
        - "3"
      image: alpine:latest
      imagePullPolicy: Always
      name: main
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: default-token-tf5qr
          readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: k3d-k3s-default-server
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  shareProcessNamespace: true
  terminationGracePeriodSeconds: 30
  tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  volumes:
    - downwardAPI:
        defaultMode: 420
        items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.annotations
            path: annotations
      name: podmetadata
    - name: my-minio-cred
      secret:
        defaultMode: 420
        items:
          - key: accesskey
            path: accesskey
          - key: secretkey
            path: secretkey
        secretName: my-minio-cred
    - name: default-token-tf5qr
      secret:
        defaultMode: 420
        secretName: default-token-tf5qr
status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      message: 'containers with unready status: [main]'
      reason: ContainersNotReady
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      message: 'containers with unready status: [main]'
      reason: ContainersNotReady
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      status: "True"
      type: PodScheduled
  containerStatuses:
    - containerID: containerd://8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76
      image: docker.io/library/alpine:latest
      imageID: docker.io/library/alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
      lastState: {}
      name: main
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: containerd://8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76
          exitCode: 128
          finishedAt: "2020-09-13T21:15:45Z"
          message: 'failed to create containerd task: failed to start io pipe copy:
          unable to copy pipes: containerd-shim: opening w/o fifo "/run/k3s/containerd/io.containerd.grpc.v1.cri/containers/8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76/io/236879711/8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76-stdout"
          failed: context deadline exceeded'
          reason: StartError
          startedAt: "1970-01-01T00:00:00Z"
    - containerID: containerd://35a12933d32e2150e5c62940f19a7b9cc57b1fbcdc11a343fe2f0f7b306694d0
      image: docker.io/argoproj/argoexec:latest
      imageID: sha256:76c472387dfe8d5cb8126b494dbe90ae9c59b9389db2761bf75db0a2d60cfbae
      lastState: {}
      name: wait
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2020-09-13T21:15:06Z"
  hostIP: 172.18.0.2
  phase: Running
  podIP: 10.42.0.204
  podIPs:
    - ip: 10.42.0.204
  qosClass: Burstable
  startTime: "2020-09-13T21:15:01Z"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor Author

alexec commented Sep 13, 2020

Fix operator.go#1073:

				log.Infof("Processing ready daemon pod: %v", pod.ObjectMeta.SelfLink)
			}

			for _, s := range append(pod.Status.InitContainerStatuses, pod.Status.ContainerStatuses...) {
				t := s.State.Terminated
				if t != nil && t.ExitCode > 0 {
					newPhase, message = inferFailedReason(pod)
				}
			}

@alexec alexec self-assigned this Sep 14, 2020
alexec added a commit to alexec/argo-workflows that referenced this issue Sep 14, 2020
@alexec alexec added the wontfix label Sep 14, 2020
@alexec alexec removed their assignment Sep 14, 2020
@stale stale bot removed the wontfix label Sep 14, 2020
@alexec
Copy link
Contributor Author

alexec commented Sep 14, 2020

This has not been seen in the wild. Maybe a rare K3S only issue, for example. Fix might create new bugs.

@alexec alexec closed this as completed Sep 14, 2020
@doubliez
Copy link

doubliez commented Feb 6, 2025

I'm running into this exact issue on a production EKS cluster. Sometimes, a pod within a workflow will fail with StartError and a low-level OCI runtime error:

Warning  Failed     46m   kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to freeze: unknown

When this happens, the workflow in Argo remains stuck in Running state indefinitely, which is big problem. I'd like Argo to automatically catch this error and retry according to the retry strategy defined in the workflow.

Granted this error shouldn't happen in the first place, and it seems related to a specific AMI or kernel version (see aws/karpenter-provider-aws#7510), but when it does happen it's problematic that there's no way to handle it in Argo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants