Prevent task manager timeout by limiting number of jobs to start #8074

fosterseth · 2020-09-03T16:34:00Z

SUMMARY

related #7655

There are two expensive code loops in the task manager:

Finding the preferred instance to run a job on (in process_pending_tasks())
Starting a task (db saves take a while)

If you queue up 500+ jobs (concurrent enabled), have a lot of instances and organizations, it's very possible that the task manager will time out (it reaps after running for 5 minutes). After a time out, it loses progress and the next cycle will start again from scratch. This creates a scenario where no jobs can be started.

The change here limits the number of jobs that the task manager can start to 150. Once this limit is reached, it can no longer turn pending jobs into running jobs. If the limit is reached, it will immediately schedule another task manager to run once the current one finishes, that way another set of 100 jobs can be started.

There is still a potential scenario where the task manager will time out. If we are under the limit (150), but we are at max capacity across available instances, we can still hit that heavy find preferred instance loop and might hit a timeout. Maybe this isn't so bad given we won't be able to start jobs anyways. Even if this happens we still will get jobs starting eventually.

To explain the above, say you have jobA and jobB. jobA can only run on instanceA, and jobB can only run on instanceB.

Queue up 5k of jobA, followed by 5k of jobB.
Say instanceA is at max capacity, while instanceB has plenty of capacity left.
The next task manager run will loop through jobA first, then jobB (order of job creation).
The task manager spends all of its time in find preferred instance for jobA and eventually times out. It didn't even get to jobB, which technically could be started because instanceB has capacity left and we are not at the start_task_limit.

I think the above scenario is unlikely for most situations, and even in that case we don't hit a deadlock and jobs will eventually complete, albeit much slower than it should.

ISSUE TYPE

Bugfix Pull Request

COMPONENT NAME

API

AWX VERSION

awx: 14.1.0

softwarefactory-project-zuul · 2020-09-03T16:44:00Z

Build succeeded.

awx-api-lint : SUCCESS in 1m 55s
awx-api : SUCCESS in 6m 55s
awx-ui : SUCCESS in 3m 45s
awx-ui-next : SUCCESS in 9m 20s
awx-swagger : SUCCESS in 8m 26s
awx-detect-schema-change : SUCCESS in 9m 05s (non-voting)
awx-ansible-modules : SUCCESS in 2m 33s

ryanpetrello

Thanks for the detailed explanation @fosterseth

This seems pretty reasonable to me. I'd love for @rebeccahhh and @chrismeyersfsu - who have more task manager experience than I do - to weigh in as well.

ryanpetrello · 2020-09-03T19:43:54Z

@moonrail @psuriset either of you have any interest in giving this patch a whirl (see: #7655)

ryanpetrello · 2020-09-03T19:44:50Z

awx/main/scheduler/task_manager.py

+        # the task manager after 5 minutes. At scale, the task manager can easily take more than
+        # 5 minutes to start pending jobs. If this limit is reached, pending jobs
+        # will no longer be started and will be started on the next task manager cycle.
+        self.start_task_limit = 150


Curious, why 150, and not - say - 100?

No particular reason -- the original issue involved around 300 jobs locking up the task manager, so I figured half of that would be a safe number.

I could see a benefit to keeping the definition of this constant in settings.

i'll look into adding this as a setting

psuriset · 2020-09-04T13:34:25Z

@ryanpetrello We will trigger few perf tests

AlanCoding · 2020-09-04T17:32:41Z

awx/main/scheduler/task_manager.py

@@ -451,6 +461,8 @@ def process_pending_tasks(self, pending_tasks):
            if self.is_job_blocked(task):
                logger.debug("{} is blocked from running".format(task.log_format))
                continue
+            if self.start_task_limit <= 0:
+                continue


wouldn't it be better to put this before self.is_job_blocked?

Or even, if you hit the condition, break instead of continue

yes you're right, can break and be before is_job_blocked

AlanCoding

I only left minor non-blocking comments, after your testing is complete, I'd say go ahead with merge

softwarefactory-project-zuul · 2020-09-04T19:44:07Z

Build succeeded.

awx-api-lint : SUCCESS in 1m 53s
awx-api : SUCCESS in 6m 53s
awx-ui : SUCCESS in 3m 50s
awx-ui-next : SUCCESS in 9m 00s
awx-swagger : SUCCESS in 8m 23s
awx-detect-schema-change : SUCCESS in 8m 26s (non-voting)
awx-ansible-modules : SUCCESS in 2m 18s

chrismeyersfsu · 2020-09-08T13:28:55Z

This is a great small patch!

It's worth mentioning the root cause here as well as possible solutions.

Root Cause

preferred_instance_groups gets called for each job that is considered via process_pending_tasks() that @fosterseth mentioned. This flow results in MANY individual queries.

Possible Solutions

leverage prefetch_related
instance_groups is a field on UnifiedJobTemplate, this is good. We could query all instance_groups for all unique job ids that we care about and use the results as a lookup table for the task manager cycle.

Below is a summary of the logic implemented in each class definition of preferred_instance_groups. After looking at the logic in each implementation we know that solution 2. is possible.

preferred_instance_groups

Model	Description
Job	job_template.instance_groups + inventory.instance_groups + organization.instance_groups
WorkflowJob	[]
Ad Hoc	inventory.instance_groups + inventory.organization.instance_groups
ProjectUpdate	super.preferred_instance_groups + organization.instance_groups
UnifiedJob	unified_job_template.instance_groups
InventoryUpdate	inventory_source.inventory.instance_groups + inventory_source.inventory.organization.instance_groups

fosterseth · 2020-09-08T16:10:00Z

@chrismeyersfsu your second solution sounds like a straightforward and really nice improvement. I think I'll implement that in a separate PR. Thanks for laying out the details.

… given run

softwarefactory-project-zuul · 2020-09-08T16:25:43Z

Build succeeded.

awx-api-lint : SUCCESS in 2m 20s
awx-api : SUCCESS in 7m 04s
awx-ui : SUCCESS in 3m 47s
awx-ui-next : SUCCESS in 8m 52s
awx-swagger : SUCCESS in 7m 40s
awx-detect-schema-change : SUCCESS in 8m 00s (non-voting)
awx-ansible-modules : SUCCESS in 2m 22s

softwarefactory-project-zuul · 2020-09-08T16:42:20Z

Build succeeded (gate pipeline).

awx-api-lint : SUCCESS in 2m 06s
awx-api : SUCCESS in 7m 44s
awx-ui : SUCCESS in 3m 34s
awx-ui-next : SUCCESS in 7m 58s
awx-swagger : SUCCESS in 8m 02s
awx-detect-schema-change : SUCCESS in 8m 08s (non-voting)
awx-ansible-modules : SUCCESS in 2m 28s
awx-push-new-schema : SUCCESS in 7m 56s (non-voting)

… given run

awxbot added type:bug component:api labels Sep 3, 2020

fosterseth requested a review from chrismeyersfsu September 3, 2020 17:50

ryanpetrello approved these changes Sep 3, 2020

View reviewed changes

ryanpetrello reviewed Sep 3, 2020

View reviewed changes

AlanCoding reviewed Sep 4, 2020

View reviewed changes

AlanCoding approved these changes Sep 4, 2020

View reviewed changes

fosterseth force-pushed the fix-7655_task_manager_times_out branch from 4cce8ed to aef4f37 Compare September 8, 2020 16:14

PR ansible#8074 - limit how many jobs the task manager can start on a…

e09274e

… given run

fosterseth force-pushed the fix-7655_task_manager_times_out branch from aef4f37 to e09274e Compare September 8, 2020 16:16

fosterseth added the mergeit label Sep 8, 2020

softwarefactory-project-zuul bot merged commit 90c3bfc into ansible:devel Sep 8, 2020

AlanCoding pushed a commit to AlanCoding/awx that referenced this pull request Sep 10, 2020

PR ansible#8074 - limit how many jobs the task manager can start on a…

6896e03

… given run

moonrail mentioned this pull request Sep 11, 2020

Scheduler/Dispatcher does not honor all schedules #7659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent task manager timeout by limiting number of jobs to start #8074

Prevent task manager timeout by limiting number of jobs to start #8074

fosterseth commented Sep 3, 2020 •

edited

Loading

softwarefactory-project-zuul bot commented Sep 3, 2020

ryanpetrello left a comment

ryanpetrello commented Sep 3, 2020

ryanpetrello Sep 3, 2020

fosterseth Sep 4, 2020

AlanCoding Sep 4, 2020

fosterseth Sep 4, 2020

psuriset commented Sep 4, 2020

AlanCoding Sep 4, 2020

fosterseth Sep 4, 2020

AlanCoding left a comment

softwarefactory-project-zuul bot commented Sep 4, 2020

chrismeyersfsu commented Sep 8, 2020

fosterseth commented Sep 8, 2020

softwarefactory-project-zuul bot commented Sep 8, 2020

softwarefactory-project-zuul bot commented Sep 8, 2020

Prevent task manager timeout by limiting number of jobs to start #8074

Prevent task manager timeout by limiting number of jobs to start #8074

Conversation

fosterseth commented Sep 3, 2020 • edited Loading

SUMMARY

ISSUE TYPE

COMPONENT NAME

AWX VERSION

softwarefactory-project-zuul bot commented Sep 3, 2020

ryanpetrello left a comment

Choose a reason for hiding this comment

ryanpetrello commented Sep 3, 2020

ryanpetrello Sep 3, 2020

Choose a reason for hiding this comment

fosterseth Sep 4, 2020

Choose a reason for hiding this comment

AlanCoding Sep 4, 2020

Choose a reason for hiding this comment

fosterseth Sep 4, 2020

Choose a reason for hiding this comment

psuriset commented Sep 4, 2020

AlanCoding Sep 4, 2020

Choose a reason for hiding this comment

fosterseth Sep 4, 2020

Choose a reason for hiding this comment

AlanCoding left a comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Sep 4, 2020

chrismeyersfsu commented Sep 8, 2020

Root Cause

Possible Solutions

fosterseth commented Sep 8, 2020

softwarefactory-project-zuul bot commented Sep 8, 2020

softwarefactory-project-zuul bot commented Sep 8, 2020

fosterseth commented Sep 3, 2020 •

edited

Loading