Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default Job Idle Timeout not working #11484

Open
3 tasks done
felipe4334 opened this issue Dec 20, 2021 · 17 comments
Open
3 tasks done

Default Job Idle Timeout not working #11484

felipe4334 opened this issue Dec 20, 2021 · 17 comments

Comments

@felipe4334
Copy link

felipe4334 commented Dec 20, 2021

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.

Summary

Default Job Idle Timeout setting on AWX job settings is not working properly.

PR #10906 does not seem to work properly when value is set above 5-6 minutes.

AWX version

AWX 19.5.0

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

Firefox, Chrome

Steps to reproduce

@oweel it looks like idle_timeout might not be working too well.

Even though idle_timeout is set to 0 its still timing out in 5-6 minutes.
image

image

  • pause:
    minutes: 20

If idle_timeout is set to something like 10 seconds then it works.
image
image

And if idle_timeout is set to like 9999 it will still timeout in 5-6 minutes:
image
image

Expected results

No timeout.

Actual results

Timeouts after 5-6 minutes.

Additional information

No response

@felipe4334
Copy link
Author

Any help?

@felipe4334
Copy link
Author

@jakemcdermott @oweel have you guys seen my comment on PR #10906 ?

@spireob
Copy link

spireob commented Jan 12, 2022

I have the same problem but in my case job is killed exactly after 60 min if idle. I have some long running jobs and this cause problems as it does not finish properly.

@shanemcd
Copy link
Member

If you look at /api/v2/jobs/5210, is there anything in the job_explanation field?

@felipe4334
Copy link
Author

@shanemcd
image
No its empty

@felipe4334
Copy link
Author

@shanemcd Were you able to take a look at this?

@felipe4334
Copy link
Author

@shanemcd Any updated on this would be highly appreciated.

@felipe4334
Copy link
Author

@shanemcd how can I help on fixing this issue?

@felipe4334
Copy link
Author

@shanemcd please can you revisit this?

@yangzhang-ibm-au
Copy link

I'm facing the same issue in AWX 19.5.1.

@TinLe
Copy link

TinLe commented May 18, 2022

FYI, I am seeing this also in 19.3.0. Timeout is not working at all.

Let me add more detail. I have the system wide timeout set to 300 (5 minutes), and that work. But setting the timeout in the template does not work. The template timeout is completely ignored, no matter what.

If I changed the system wide timeout to 0, the default, the timeout setting in the templates are still ignored.

@felipe4334
Copy link
Author

@shanemcd Do you have any insight on why this issue is not fixed? I updated to 0.21.0 and still getting the same issue even with the latest EE images

@AlanCoding
Copy link
Member

I have not been able to produce this by running a playbook that does command: sleep 700, with the global idle timeout set to 600, it times out, without that set, the job will be successful. Testing with shorter time frames, I can't confirm any bug with the pause module either.

@felipe4334
Copy link
Author

@AlanCoding what are you running awx on? Because the bug I experience is the kubernetes pod running the job gets terminated after 5 minutes of idle time.

@AlanCoding
Copy link
Member

This issue is "Default Job Idle Timeout not working", getting terminated by OCP after 5 minutes of idleness is kind of a different thing. I put up #12289 to adjust the help text, and I see that mixed messages might be confusing the issue.

I kicked off a job to see if I can reproduce.

@AlanCoding
Copy link
Member

Right now I don't seem to be able to reproduce this, as container group jobs can sleep for 10 minutes or so without problem. Nothing in the pod spec jumps out to me as related either.

@williamoliveira
Copy link

Have the same problem but it is timing out after 25 minutes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants