Workflow stuck when Failed pod status switches back to Pending #2993

mmaggiolo · 2020-05-11T09:32:33Z

hecklist:

I've included the version.
I've included reproduction steps.
I've included the workflow YAML.
I've included the logs.

What happened:
I have a large workflow that I need to run, represented as a DAG with a large number of mutually independent nodes (probably not the best way to be represented), that run in parallel. Each of the tasks in the DAG could take from a few minutes to a few hours to complete, and there are between 200 to 600 of such tasks scheduled in the workflow.

Whenever I run the workflow (tried both as a Workflow and CronWorkflow), all of the tasks complete, some of them after failing and retrying. However, there is always at least one task that ends up looking like this (see graphs(133:133)(0)):

[...]
 ├-✔ graphs(132:132)              graph
 | ├-✖ graphs(132:132)(0)         graph        g-20200422110754-000001-kg42p-1065241647  33m       pod termination
 | └-✔ graphs(132:132)(1)         graph        g-20200422110754-000001-kg42p-595615410   24m
 ├-✔ graphs(133:133)              graph
 | ├-◷ graphs(133:133)(0)         graph        g-20200422110754-000001-kg42p-2541540603  33m       ContainerCreating
 | └-✔ graphs(133:133)(1)         graph        g-20200422110754-000001-kg42p-2071914366  25m
 ├-✔ graphs(134:134)              graph
 | ├-✖ graphs(134:134)(0)         graph        g-20200422110754-000001-kg42p-2203744947  33m       pod termination
 | └-✔ graphs(134:134)(1)         graph        g-20200422110754-000001-kg42p-2807886326  7m
[...]

The first pod to attempt the task seems to be stuck in ContainerCreating, however the second attempt seems to be completed successfully. Whenever one pod manifests this behaviour, the workflow remains in the Running state, even though the parent task graphs(133:133) is marked as completed.

According to Kubernetes, the pod actually does not exist anymore, and a workflow like this one will remain in the Running status forever.

I can verify externally that the parent task graphs(133:133) has indeed completed, as it has populated a bucket in GCS with the output for that task.

See the Anything else we need to know section for additional info.

What you expected to happen:

Argo should set the message of graphs(133:133)(0) to pod termination or similar, and the workflow should complete and be labeled as Succeeded.

How to reproduce it (as minimally and precisely as possible):
This is the workflow that manifests the issue for me (it is not directly runnable, as I had to remove sensitive info):

apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
  entrypoint: begin
  templates:
  - name: begin
    dag:
      tasks:
      - name: graphs
        template: graph
        arguments:
          parameters:
          - name: task-id
            value: "{{item}}"
        withSequence:
          count: 200
  - name: graph
    inputs:
      parameters:
      - name: task-id
    metadata:
      annotations:
        "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: preemptible
              operator: In
              values:
              - "True"
            - key: cpu
              operator: In
              values:
              - "32"
    container:
      image: <image name>:<image tag>
      args: ["execute", "--task-id {{inputs.parameters.task-id}}"]
      resources:
        requests:
          cpu: 1.00
          memory: 3.50Gi
        limits:
          cpu: 1.00
          memory: 24.0Gi
      imagePullPolicy: Always
    resubmitPendingPods: true
    retryStrategy:
      limit: 10
      retryPolicy: "Always"
      backoff:
        duration: 15
        factor: 1

Anything else we need to know?:

The workflow is run on a GKE cluster on a node pool with a Cluster Autoscaler.
The node pool consists of preemptible machines. However, the workflow seems to behave correctly when a node is shut down, as the pod is labeled with ⚠, the message is pod deleted, and the next attempt succeeds.
Going through the Stackdriver logs for both the cluster and the workflow-controller, I reconstructed this timeline:
- [10:16:00] Pod is created and started successfully
- [10:46:39] Cluster Autoscaler decides to kill the node on which the pod is running
- [10:46:41] Workflow controller sets the phase of the pod from Running to Failed and the message to pod termination.
- [10:46:44] Pod is labeled as completed
- [10:46:55] New pod created and successfully transition its phase to Pending
- [10:47:00] New pod transitioning its phase to Running
- [10:47:15] The old pod's phase is updated from Failed -> Pending, and the message is set to ContainerCreating
- [11:12:14] New pod Succeeds and the parent task's phase is also set to Succeeded

As you can see, for some reason the pod that previously failed and was replaced is labeled as Pending again, from which it will never recover. I believe that is the reason why the workflow never succeeds.

Environment:
Running on GKE.

Argo version:

$ argo version
argo: v2.7.3
  BuildDate: 2020-04-15T22:52:59Z
  GitCommit: 66bd0425280c801c06f21cf9a4bed46ee6f1e660
  GitTreeState: clean
  GitTag: v2.7.3
  GoVersion: go1.13.4
  Compiler: gc
  Platform: darwin/amd64

Workflow controller/Argo server/Argoexec are all using the v2.8.0-rc3 image, but I have tried with v2.7.3 and v2.8.0 as well and the problem remains.
Kubernetes version :

$ kubectl version -o yaml
clientVersion:
  buildDate: "2019-10-15T19:16:51Z"
  compiler: gc
  gitCommit: 20c265fef0741dd71a66480e35bd69f18351daea
  gitTreeState: clean
  gitVersion: v1.15.5
  goVersion: go1.12.10
  major: "1"
  minor: "15"
  platform: darwin/amd64
serverVersion:
  buildDate: "2020-03-31T02:49:49Z"
  compiler: gc
  gitCommit: a5bf731ea129336a3cf32c3375317b3a626919d7
  gitTreeState: clean
  gitVersion: v1.15.11-gke.5
  goVersion: go1.12.17b4
  major: "1"
  minor: 15+
  platform: linux/amd64

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

The text was updated successfully, but these errors were encountered:

simster7 · 2020-05-11T14:18:04Z

Something that is invaluable when debugging similar errors is the full Workflow object that causes this error after it finishes running (or in this case stops running any further).

You can get it by running kubectl get wf <NAME> -o yaml. If it contains any sensitive company data and you're not comfortable sharing it publicly, you can share it with me privately on the project Slack in @simon

simster7 · 2020-05-30T16:02:28Z

Possible workaround for this is to omit resubmitPendingPods: true

alexec · 2020-06-22T23:24:57Z

#3214

alexec · 2020-07-10T20:21:05Z

Discussed on Slack and fixed.

mmaggiolo added the type/bug label May 11, 2020

simster7 self-assigned this May 11, 2020

simster7 added the solution/workaround There's a workaround, might not be great, but exists label May 30, 2020

simster7 removed their assignment Jul 10, 2020

alexec closed this as completed Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow stuck when Failed pod status switches back to Pending #2993

Workflow stuck when Failed pod status switches back to Pending #2993

mmaggiolo commented May 11, 2020

simster7 commented May 11, 2020

simster7 commented May 30, 2020

alexec commented Jun 22, 2020

alexec commented Jul 10, 2020

Workflow stuck when Failed pod status switches back to Pending #2993

Workflow stuck when Failed pod status switches back to Pending #2993

Comments

mmaggiolo commented May 11, 2020

simster7 commented May 11, 2020

simster7 commented May 30, 2020

alexec commented Jun 22, 2020

alexec commented Jul 10, 2020