-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent task manager timeout by limiting number of jobs to start #8074
Prevent task manager timeout by limiting number of jobs to start #8074
Conversation
Build succeeded.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation @fosterseth
This seems pretty reasonable to me. I'd love for @rebeccahhh and @chrismeyersfsu - who have more task manager experience than I do - to weigh in as well.
awx/main/scheduler/task_manager.py
Outdated
# the task manager after 5 minutes. At scale, the task manager can easily take more than | ||
# 5 minutes to start pending jobs. If this limit is reached, pending jobs | ||
# will no longer be started and will be started on the next task manager cycle. | ||
self.start_task_limit = 150 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why 150, and not - say - 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason -- the original issue involved around 300 jobs locking up the task manager, so I figured half of that would be a safe number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see a benefit to keeping the definition of this constant in settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll look into adding this as a setting
@ryanpetrello We will trigger few perf tests |
awx/main/scheduler/task_manager.py
Outdated
@@ -451,6 +461,8 @@ def process_pending_tasks(self, pending_tasks): | |||
if self.is_job_blocked(task): | |||
logger.debug("{} is blocked from running".format(task.log_format)) | |||
continue | |||
if self.start_task_limit <= 0: | |||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be better to put this before self.is_job_blocked
?
Or even, if you hit the condition, break
instead of continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you're right, can break
and be before is_job_blocked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only left minor non-blocking comments, after your testing is complete, I'd say go ahead with merge
Build succeeded.
|
This is a great small patch! It's worth mentioning the root cause here as well as possible solutions. Root Cause
Possible Solutions
Below is a summary of the logic implemented in each class definition of preferred_instance_groups
|
@chrismeyersfsu your second solution sounds like a straightforward and really nice improvement. I think I'll implement that in a separate PR. Thanks for laying out the details. |
4cce8ed
to
aef4f37
Compare
aef4f37
to
e09274e
Compare
Build succeeded.
|
Build succeeded (gate pipeline).
|
SUMMARY
related #7655
There are two expensive code loops in the task manager:
process_pending_tasks()
)If you queue up 500+ jobs (concurrent enabled), have a lot of instances and organizations, it's very possible that the task manager will time out (it reaps after running for 5 minutes). After a time out, it loses progress and the next cycle will start again from scratch. This creates a scenario where no jobs can be started.
The change here limits the number of jobs that the task manager can start to 150. Once this limit is reached, it can no longer turn pending jobs into running jobs. If the limit is reached, it will immediately schedule another task manager to run once the current one finishes, that way another set of 100 jobs can be started.
There is still a potential scenario where the task manager will time out. If we are under the limit (150), but we are at max capacity across available instances, we can still hit that heavy
find preferred instance
loop and might hit a timeout. Maybe this isn't so bad given we won't be able to start jobs anyways. Even if this happens we still will get jobs starting eventually.To explain the above, say you have jobA and jobB. jobA can only run on instanceA, and jobB can only run on instanceB.
Queue up 5k of jobA, followed by 5k of jobB.
Say instanceA is at max capacity, while instanceB has plenty of capacity left.
The next task manager run will loop through jobA first, then jobB (order of job creation).
The task manager spends all of its time in
find preferred instance
for jobA and eventually times out. It didn't even get to jobB, which technically could be started because instanceB has capacity left and we are not at thestart_task_limit
.I think the above scenario is unlikely for most situations, and even in that case we don't hit a deadlock and jobs will eventually complete, albeit much slower than it should.
ISSUE TYPE
COMPONENT NAME
AWX VERSION