Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stuck with deleted pods #3857

Closed
fabio-rigato opened this issue Aug 25, 2020 · 16 comments · Fixed by #4064
Closed

stuck with deleted pods #3857

fabio-rigato opened this issue Aug 25, 2020 · 16 comments · Fixed by #4064
Assignees
Labels

Comments

@fabio-rigato
Copy link
Contributor

Hi guys,
Argo version 2.9.5
As I shared already the Argo get YAML file to @simster7 privately I'm facing a similar issue as #3665
In particular, this is a screenshot of Argo UI:

Screen Shot 2020-08-25 at 2 13 11 pm

I am also unable to find the following error in my logs:
Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"workflow-XXXXXXXXX\": the object has been modified; please apply your changes to the latest version and try again Conflict

Thanks for your time and effort in fixing that.
Cheers,
Fabio

@alexec
Copy link
Contributor

alexec commented Aug 25, 2020

Simon is out on vacation this week. Are you able to get us up to speed by providing more information please? E.g. what status did you expect the workflow to result in? Any logs you can attach?

@fabio-rigato
Copy link
Contributor Author

Hey @alexec,
I shared with you on slack the output YAML file of the Argo workflow and a screenshot of the terminal where you can see the issue.
Please let me know if you need more info.
Cheers,
Fabio

@fabio-rigato
Copy link
Contributor Author

Hey @simster7 and @alexec,

I hope you are well.

I'm facing the same issue but this time stuck in ContainerCreating from 8 hours.
I shared with you guys on slack the YAML "live" file.

I can attach here a screenshot of the UI just for tracking:

Screen Shot 2020-09-02 at 9 19 02 am

Please let me know if you need more info or details.
Thanks for your time.

Cheers,
Fabio

@fabio-rigato
Copy link
Contributor Author

Hey guys just an update.
I found 4 more stuck Workflow, I shared the logs to @alexec and @simster7 on slack.
Cheers,
Fabio

@fabio-rigato
Copy link
Contributor Author

Hey, @alexec and @simster7 I shared with you guys another stuck WF (in slack).

This is a screenshot of the UI, just to keep track:

Screen Shot 2020-09-07 at 9 56 14 am

Cheers,
Fabio

@alexec
Copy link
Contributor

alexec commented Sep 8, 2020

@fabio-rigato can you confirm if the pod actually ran? If so, can you provide the controller logs from the period of time it finished.

@simster7
Copy link
Member

simster7 commented Sep 8, 2020

This seems to be an issue with how pods are scheduled and ran in your environment as opposed to any issues with Argo itself, I think. The controller logs should help diagnosing this.

@fabio-rigato
Copy link
Contributor Author

Hey @alexec , @simster7
thanks for your reply.
I checked in out logs and the pod was definitely running and then disappeared without updating the Argo workflow step status. I shared with you guys the controller log on slack in the time range of the pod execution time.
@simster7 I have just created a fresh k8s cluster with only Argo installed and a couple of our services in different namespaces, I am pretty sure the issue we are facing is not because of our environment. But I am very happy to have a call with you guys and share with you my screen and show you our cluster.
Please let me know how can I help you in fixing this bug.
Cheers,
Fabio

@alexec
Copy link
Contributor

alexec commented Sep 14, 2020

Maybe related #4011

@alexec
Copy link
Contributor

alexec commented Sep 15, 2020

Maybe related #3097 and #3469

@alexec
Copy link
Contributor

alexec commented Sep 15, 2020

Maybe related #3862

@alexec
Copy link
Contributor

alexec commented Sep 15, 2020

@sarabala1979 I've added you as your pointed out there is probably a problem with the pod deletion handler simply ignoring events (because labels are lost). Discussed with @jessesuen.

I've POC removing the podQueue and directly queuing pods. Passed all tests and worked well with 1000 node workflow. I'd lean toward this.

Alex

NOTE! This may not fix this bug!

@alexec
Copy link
Contributor

alexec commented Sep 15, 2020

@sarabala1979 did you want to own the podQueue change?

@sarabala1979
Copy link
Member

Sure, I can do it. So we are removing podQueue and inline the logic. @alexec @jessesuen Please confirm my understanding is right.

@alexec
Copy link
Contributor

alexec commented Sep 16, 2020

@sarabala1979 @jessesuen I'd like to back-port any change to v2.11. But I think removing podQueue is too risky to backport. How about:

  1. PR Add a Getting Started section to the README.md file #1 - add the workflow to the wfQueue on pod deletion - back-port to v2.11
  2. PR Add stress test yaml 2.0 template #2 - remove podQueue - v2.12 - (and we could not do this if we don't want to)

Everyone is happy.

@fabio-rigato
Copy link
Contributor Author

Hey guys,
thank you for working on that.
Please let me know when will be available a stable release that fixed this bug.
Hope you a great day.
Cheers,
Fabio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants