Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod gets stuck forever in pending state if it was deleted before initialisation #2711

Closed
4 tasks done
vmanucharyan opened this issue Apr 16, 2020 · 2 comments · Fixed by #2721
Closed
4 tasks done
Assignees
Labels

Comments

@vmanucharyan
Copy link
Contributor

vmanucharyan commented Apr 16, 2020

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:

In some cases kubernetes deletes pods before or right after it was initialized and argo misses that event - resulting in node stuck forever in "Pending" state. In our case it happens often under the load. According to our logs kube-scheduler deletes some pods shortly after they have been instantiated.

What you expected to happen:

Node changes phase to "Error" or "Failed".

How to reproduce it (as minimally and precisely as possible):

  1. submit this workflow:
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: hello-world-
    spec:
      entrypoint: whalesay
      templates:
        - name: whalesay
          container:
            image: nonexistent
            command: [cowsay]
            args: ["hello world"]
  2. Manually delete the pod hello-world-xxxxx
  3. Check that node is still in pending state

Anything else we need to know?:

Before this PR - #2385
this case was detected and pod's state changed to "Error"

Environment:

  • Argo version:
argo: v2.7.1
  BuildDate: 2020-04-07T16:40:24Z
  GitCommit: 2a3f59c10ae260a460b6ad97a0cadd8667d4b488
  GitTreeState: clean
  GitTag: v2.7.1
  GoVersion: go1.13.4
  Compiler: gc
  Platform: darwin/amd64
  • Kubernetes version :
clientVersion:
  buildDate: "2019-08-19T12:36:28Z"
  compiler: gc
  gitCommit: 2d3c76f9091b6bec110a5e63777c332469e0cba2
  gitTreeState: clean
  gitVersion: v1.15.3
  goVersion: go1.12.9
  major: "1"
  minor: "15"
  platform: darwin/amd64
serverVersion:
  buildDate: "2020-03-24T18:54:48Z"
  compiler: gc
  gitCommit: 71631703d37fb1b4b333abd427deae839aeb2032
  gitTreeState: clean
  gitVersion: v1.15.11-gke.3
  goVersion: go1.12.17b4
  major: "1"
  minor: 15+
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
➜  test-data git:(feature/td-45) argo get hello-world-76n8f
Name:                hello-world-76n8f
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Thu Apr 16 15:06:04 +0300 (26 seconds ago)
Started:             Thu Apr 16 15:06:04 +0300 (26 seconds ago)
Duration:            26 seconds

STEP                  TEMPLATE  PODNAME            DURATION  MESSAGE
 ◷ hello-world-76n8f  whalesay  hello-world-76n8f  26s       ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: pull access denied for nonexistent, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  • executor logs:
kubectl logs <failedpodname> -c init
kubectl logs <failedpodname> -c wait
  • workflow-controller logs:
time="2020-04-16T12:02:20Z" level=info msg="config map" name=argo-workflow-controller-configmap
time="2020-04-16T12:02:20Z" level=info msg="Configuration:\nartifactRepository:\n  s3:\n    accessKeySecret:\n      key: access_key\n      name: argo-minio-credentials\n    bucket: argo-artifacts\n    endpoint: argo-minio:9000\n    insecure: true\n    secretKeySecret:\n      key: secret_key\n      name: argo-minio-credentials\ncontainerRuntimeExecutor: docker\nfeatureFlags: {}\nmetricsConfig:\n  disableLegacy: false\npersistence:\n  archive: true\n  connectionPool: null\n  postgresql:\n    database: argo\n    host: postgres\n    passwordSecret:\n      key: password\n      name: argo-postgres\n    port: \"5432\"\n    tableName: argo_workflows\n    userNameSecret:\n      key: username\n      name: argo-postgres\npodSpecLogStrategy: {}\ntelemetryConfig:\n  disableLegacy: false\n"
time="2020-04-16T12:02:20Z" level=info msg="Persistence configuration enabled"
time="2020-04-16T12:02:20Z" level=info msg="Creating DB session"
time="2020-04-16T12:02:20Z" level=info msg="Persistence Session created successfully"
time="2020-04-16T12:02:20Z" level=info msg="Migrating database schema" clusterName=default dbType=postgres
time="2020-04-16T12:02:20Z" level=info msg="Node status offloading is disabled"
time="2020-04-16T12:02:20Z" level=info msg="Workflow archiving is enabled"
time="2020-04-16T12:02:20Z" level=info msg="Starting CronWorkflow controller"
time="2020-04-16T12:02:20Z" level=info msg="Starting workflow TTL controller (resync 20m0s)"
time="2020-04-16T12:02:20Z" level=info msg="Starting Workflow Controller" version=vv2.7.1+2a3f59c.dirty
time="2020-04-16T12:02:20Z" level=info msg="Workers: workflow: 8, pod: 8"
time="2020-04-16T12:02:20Z" level=info msg="Performing periodic GC every 5m0s"
time="2020-04-16T12:02:20Z" level=info msg="Started workflow TTL worker"
time="2020-04-16T12:06:04Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:04Z" level=info msg="Updated phase  -> Running" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:04Z" level=info msg="Pod node {hello-world-76n8f hello-world-76n8f hello-world-76n8f Pod whalesay nil    Pending   2020-04-16 12:06:04.9990897 +0000 UTC 0001-01-01 00:00:00 +0000 UTC   <nil> nil nil [] []} initialized Pending" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:05Z" level=info msg="Created pod: hello-world-76n8f (hello-world-76n8f)" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:05Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=180991 workflow=hello-world-76n8f
time="2020-04-16T12:06:06Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:06Z" level=info msg="Updating node &NodeStatus{ID:hello-world-76n8f,Name:hello-world-76n8f,DisplayName:hello-world-76n8f,Type:Pod,TemplateName:whalesay,TemplateRef:nil,Phase:Pending,BoundaryID:,Message:,StartedAt:2020-04-16 12:06:04 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} message: ContainerCreating"
time="2020-04-16T12:06:06Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=180996 workflow=hello-world-76n8f
time="2020-04-16T12:06:07Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:10Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:10Z" level=info msg="Updating node &NodeStatus{ID:hello-world-76n8f,Name:hello-world-76n8f,DisplayName:hello-world-76n8f,Type:Pod,TemplateName:whalesay,TemplateRef:nil,Phase:Pending,BoundaryID:,Message:ContainerCreating,StartedAt:2020-04-16 12:06:04 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} message: ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: pull access denied for nonexistent, repository does not exist or may require 'docker login': denied: requested access to the resource is denied"
time="2020-04-16T12:06:10Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=181007 workflow=hello-world-76n8f
time="2020-04-16T12:06:11Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:23Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:23Z" level=info msg="Updating node &NodeStatus{ID:hello-world-76n8f,Name:hello-world-76n8f,DisplayName:hello-world-76n8f,Type:Pod,TemplateName:whalesay,TemplateRef:nil,Phase:Pending,BoundaryID:,Message:ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: pull access denied for nonexistent, repository does not exist or may require 'docker login': denied: requested access to the resource is denied,StartedAt:2020-04-16 12:06:04 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} message: ImagePullBackOff: Back-off pulling image \"nonexistent\""
time="2020-04-16T12:06:23Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=181030 workflow=hello-world-76n8f
time="2020-04-16T12:06:24Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:26Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:26Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:26Z" level=info msg="Updating node &NodeStatus{ID:hello-world-76n8f,Name:hello-world-76n8f,DisplayName:hello-world-76n8f,Type:Pod,TemplateName:whalesay,TemplateRef:nil,Phase:Pending,BoundaryID:,Message:ImagePullBackOff: Back-off pulling image \"nonexistent\",StartedAt:2020-04-16 12:06:04 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} message: ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: pull access denied for nonexistent, repository does not exist or may require 'docker login': denied: requested access to the resource is denied"
time="2020-04-16T12:06:26Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=181041 workflow=hello-world-76n8f
time="2020-04-16T12:06:27Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f
time="2020-04-16T12:06:28Z" level=info msg="Processing workflow" namespace=default workflow=hello-world-76n8f

Logs

argo get <workflowname>
kubectl logs <failedpodname> -c init
kubectl logs <failedpodname> -c wait
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@vmanucharyan
Copy link
Contributor Author

vmanucharyan commented Apr 16, 2020

Looks like my issue can be fixed by enabling resubmitPendingPods, but this still looks like a bug. Can you take a look at my PR that i closed and tell if it's the right approach? Maybe we should check if this option is disabled and move pod to Error state in that case, what do you think?

@salanki
Copy link
Contributor

salanki commented Apr 17, 2020

I just had a Pod that was deleted due to node failure. Usually, Argo fails the step with pod termination and retries. In this case, it got stuck in:

Name:                md-npt-secyyhyvomdxdcy
ServiceAccount:      default
Status:              Running
Created:             Fri Apr 17 11:18:53 -0400 (2 hours ago)
Started:             Fri Apr 17 11:18:53 -0400 (2 hours ago)
Duration:            2 hours 10 minutes

STEP                                                            PODNAME                            DURATION  MESSAGE
 ● md-npt-secyyhyvomdxdcy (md-npt-secyyhyvomdxdcy)
 └---◷ md-npt-secyyhyvomdxdcy-0(0) (rscloud-main/rscloud-main)  md-npt-secyyhyvomdxdcy-1113725271  1h        ContainerCreating

The pod md-npt-secyyhyvomdxdcy-1113725271 does not actually exist anymore. Do you think it can be related to the same issue, or is this a different race condition of sorts?

Normal behavior:

Name:                long-running-job-quwflcokhagrcww
ServiceAccount:      default
Status:              Running
Created:             Fri Apr 17 13:26:08 -0400 (6 minutes ago)
Started:             Fri Apr 17 13:26:07 -0400 (6 minutes ago)
Duration:            6 minutes 34 seconds

STEP                                                                        PODNAME                                      DURATION  MESSAGE
 ● long-running-job-quwflcokhagrcww (long-running-job-quwflcokhagrcww)
 └---● long-running-job-quwflcokhagrcww-0 (rscloud-main/rscloud-main)
     ├-✖ long-running-job-quwflcokhagrcww-0(0) (rscloud-main/rscloud-main)  long-running-job-quwflcokhagrcww-3152886675  4m        pod termination
     └-● long-running-job-quwflcokhagrcww-0(1) (rscloud-main/rscloud-main)  long-running-job-quwflcokhagrcww-3757028054  2m

Both are from 2.7.2, so it seems to be timing / race related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants