Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent task manager timeout by limiting number of jobs to start #8074

Conversation

fosterseth
Copy link
Member

@fosterseth fosterseth commented Sep 3, 2020

SUMMARY

related #7655

There are two expensive code loops in the task manager:

  • Finding the preferred instance to run a job on (in process_pending_tasks())
  • Starting a task (db saves take a while)

If you queue up 500+ jobs (concurrent enabled), have a lot of instances and organizations, it's very possible that the task manager will time out (it reaps after running for 5 minutes). After a time out, it loses progress and the next cycle will start again from scratch. This creates a scenario where no jobs can be started.

The change here limits the number of jobs that the task manager can start to 150. Once this limit is reached, it can no longer turn pending jobs into running jobs. If the limit is reached, it will immediately schedule another task manager to run once the current one finishes, that way another set of 100 jobs can be started.

There is still a potential scenario where the task manager will time out. If we are under the limit (150), but we are at max capacity across available instances, we can still hit that heavy find preferred instance loop and might hit a timeout. Maybe this isn't so bad given we won't be able to start jobs anyways. Even if this happens we still will get jobs starting eventually.

To explain the above, say you have jobA and jobB. jobA can only run on instanceA, and jobB can only run on instanceB.

  • Queue up 5k of jobA, followed by 5k of jobB.

  • Say instanceA is at max capacity, while instanceB has plenty of capacity left.

  • The next task manager run will loop through jobA first, then jobB (order of job creation).

  • The task manager spends all of its time in find preferred instance for jobA and eventually times out. It didn't even get to jobB, which technically could be started because instanceB has capacity left and we are not at the start_task_limit.

I think the above scenario is unlikely for most situations, and even in that case we don't hit a deadlock and jobs will eventually complete, albeit much slower than it should.

ISSUE TYPE
  • Bugfix Pull Request
COMPONENT NAME
  • API
AWX VERSION
awx: 14.1.0

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

Copy link
Contributor

@ryanpetrello ryanpetrello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation @fosterseth

This seems pretty reasonable to me. I'd love for @rebeccahhh and @chrismeyersfsu - who have more task manager experience than I do - to weigh in as well.

@ryanpetrello
Copy link
Contributor

@moonrail @psuriset either of you have any interest in giving this patch a whirl (see: #7655)

# the task manager after 5 minutes. At scale, the task manager can easily take more than
# 5 minutes to start pending jobs. If this limit is reached, pending jobs
# will no longer be started and will be started on the next task manager cycle.
self.start_task_limit = 150
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why 150, and not - say - 100?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason -- the original issue involved around 300 jobs locking up the task manager, so I figured half of that would be a safe number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see a benefit to keeping the definition of this constant in settings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll look into adding this as a setting

@psuriset
Copy link

psuriset commented Sep 4, 2020

@ryanpetrello We will trigger few perf tests

@@ -451,6 +461,8 @@ def process_pending_tasks(self, pending_tasks):
if self.is_job_blocked(task):
logger.debug("{} is blocked from running".format(task.log_format))
continue
if self.start_task_limit <= 0:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to put this before self.is_job_blocked?

Or even, if you hit the condition, break instead of continue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you're right, can break and be before is_job_blocked

Copy link
Member

@AlanCoding AlanCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only left minor non-blocking comments, after your testing is complete, I'd say go ahead with merge

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@chrismeyersfsu
Copy link
Member

This is a great small patch!

It's worth mentioning the root cause here as well as possible solutions.

Root Cause

preferred_instance_groups gets called for each job that is considered via process_pending_tasks() that @fosterseth mentioned. This flow results in MANY individual queries.

Possible Solutions

  1. leverage prefetch_related
  2. instance_groups is a field on UnifiedJobTemplate, this is good. We could query all instance_groups for all unique job ids that we care about and use the results as a lookup table for the task manager cycle.

Below is a summary of the logic implemented in each class definition of preferred_instance_groups. After looking at the logic in each implementation we know that solution 2. is possible.

preferred_instance_groups

Model Description
Job job_template.instance_groups + inventory.instance_groups + organization.instance_groups
WorkflowJob []
Ad Hoc inventory.instance_groups + inventory.organization.instance_groups
ProjectUpdate super.preferred_instance_groups + organization.instance_groups
UnifiedJob unified_job_template.instance_groups
InventoryUpdate inventory_source.inventory.instance_groups + inventory_source.inventory.organization.instance_groups

@fosterseth
Copy link
Member Author

@chrismeyersfsu your second solution sounds like a straightforward and really nice improvement. I think I'll implement that in a separate PR. Thanks for laying out the details.

@fosterseth fosterseth force-pushed the fix-7655_task_manager_times_out branch from 4cce8ed to aef4f37 Compare September 8, 2020 16:14
@fosterseth fosterseth force-pushed the fix-7655_task_manager_times_out branch from aef4f37 to e09274e Compare September 8, 2020 16:16
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded (gate pipeline).

@softwarefactory-project-zuul softwarefactory-project-zuul bot merged commit 90c3bfc into ansible:devel Sep 8, 2020
AlanCoding pushed a commit to AlanCoding/awx that referenced this pull request Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants