You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
I have a large workflow that I need to run, represented as a DAG with a large number of mutually independent nodes (probably not the best way to be represented), that run in parallel. Each of the tasks in the DAG could take from a few minutes to a few hours to complete, and there are between 200 to 600 of such tasks scheduled in the workflow.
Whenever I run the workflow (tried both as a Workflow and CronWorkflow), all of the tasks complete, some of them after failing and retrying. However, there is always at least one task that ends up looking like this (see graphs(133:133)(0)):
The first pod to attempt the task seems to be stuck in ContainerCreating, however the second attempt seems to be completed successfully. Whenever one pod manifests this behaviour, the workflow remains in the Running state, even though the parent task graphs(133:133) is marked as completed.
According to Kubernetes, the pod actually does not exist anymore, and a workflow like this one will remain in the Running status forever.
I can verify externally that the parent task graphs(133:133) has indeed completed, as it has populated a bucket in GCS with the output for that task.
See the Anything else we need to know section for additional info.
What you expected to happen:
Argo should set the message of graphs(133:133)(0) to pod termination or similar, and the workflow should complete and be labeled as Succeeded.
How to reproduce it (as minimally and precisely as possible):
This is the workflow that manifests the issue for me (it is not directly runnable, as I had to remove sensitive info):
The workflow is run on a GKE cluster on a node pool with a Cluster Autoscaler.
The node pool consists of preemptible machines. However, the workflow seems to behave correctly when a node is shut down, as the pod is labeled with ⚠, the message is pod deleted, and the next attempt succeeds.
Going through the Stackdriver logs for both the cluster and the workflow-controller, I reconstructed this timeline:
[10:16:00] Pod is created and started successfully
[10:46:39] Cluster Autoscaler decides to kill the node on which the pod is running
[10:46:41] Workflow controller sets the phase of the pod from Running to Failed and the message to pod termination.
[10:46:44] Pod is labeled as completed
[10:46:55] New pod created and successfully transition its phase to Pending
[10:47:00] New pod transitioning its phase to Running
[10:47:15] The old pod's phase is updated from Failed -> Pending, and the message is set to ContainerCreating
[11:12:14] New pod Succeeds and the parent task's phase is also set to Succeeded
As you can see, for some reason the pod that previously failed and was replaced is labeled as Pending again, from which it will never recover. I believe that is the reason why the workflow never succeeds.
Something that is invaluable when debugging similar errors is the full Workflow object that causes this error after it finishes running (or in this case stops running any further).
You can get it by running kubectl get wf <NAME> -o yaml. If it contains any sensitive company data and you're not comfortable sharing it publicly, you can share it with me privately on the project Slack in @simon
hecklist:
What happened:
I have a large workflow that I need to run, represented as a DAG with a large number of mutually independent nodes (probably not the best way to be represented), that run in parallel. Each of the tasks in the DAG could take from a few minutes to a few hours to complete, and there are between 200 to 600 of such tasks scheduled in the workflow.
Whenever I run the workflow (tried both as a Workflow and CronWorkflow), all of the tasks complete, some of them after failing and retrying. However, there is always at least one task that ends up looking like this (see
graphs(133:133)(0)
):The first pod to attempt the task seems to be stuck in
ContainerCreating
, however the second attempt seems to be completed successfully. Whenever one pod manifests this behaviour, the workflow remains in theRunning
state, even though the parent taskgraphs(133:133)
is marked as completed.According to Kubernetes, the pod actually does not exist anymore, and a workflow like this one will remain in the
Running
status forever.I can verify externally that the parent task
graphs(133:133)
has indeed completed, as it has populated a bucket in GCS with the output for that task.See the Anything else we need to know section for additional info.
What you expected to happen:
Argo should set the message of
graphs(133:133)(0)
topod termination
or similar, and the workflow should complete and be labeled asSucceeded
.How to reproduce it (as minimally and precisely as possible):
This is the workflow that manifests the issue for me (it is not directly runnable, as I had to remove sensitive info):
Anything else we need to know?:
pod deleted
, and the next attempt succeeds.workflow-controller
, I reconstructed this timeline:Running
toFailed
and the message topod termination
.Pending
Running
Failed
->Pending
, and the message is set toContainerCreating
Succeeded
As you can see, for some reason the pod that previously failed and was replaced is labeled as Pending again, from which it will never recover. I believe that is the reason why the workflow never succeeds.
Environment:
Running on GKE.
Workflow controller/Argo server/Argoexec are all using the
v2.8.0-rc3
image, but I have tried withv2.7.3
andv2.8.0
as well and the problem remains.Kubernetes version :
Message from the maintainers:
If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: