reduce per-job database query count #8333

chrismeyersfsu · 2020-10-06T18:59:29Z

Do not query the database for the set of Instance that belong to the
group for which we are trying to fit a job on, for each job.
Instead, cache the set of instances per-instance group.

Before these changes, the task manager would take 1-2 seconds to decide and begin running a job. With these changes, it takes about 0.05 seconds per job.

Name	200 jobs	1,000 jobs
Fit optimization	11s \| 0.055 s/job	58s \| 0.058 s/job

softwarefactory-project-zuul · 2020-10-06T19:08:14Z

Build failed.

awx-api-lint : FAILURE in 1m 48s
awx-api : FAILURE in 6m 39s
awx-ui : SUCCESS in 3m 25s
awx-ui-next : SUCCESS in 8m 26s
awx-swagger : SUCCESS in 7m 59s
awx-detect-schema-change : SUCCESS in 7m 56s (non-voting)
awx-ansible-modules : SUCCESS in 2m 16s

softwarefactory-project-zuul · 2020-10-06T19:27:32Z

Build failed.

awx-api-lint : SUCCESS in 2m 01s
awx-api : FAILURE in 6m 47s
awx-ui : SUCCESS in 3m 39s
awx-ui-next : SUCCESS in 8m 15s
awx-swagger : SUCCESS in 8m 04s
awx-detect-schema-change : SUCCESS in 8m 10s (non-voting)
awx-ansible-modules : SUCCESS in 2m 22s

softwarefactory-project-zuul · 2020-10-06T20:26:09Z

Build succeeded.

awx-api-lint : SUCCESS in 1m 57s
awx-api : SUCCESS in 8m 39s
awx-ui : SUCCESS in 3m 26s
awx-ui-next : SUCCESS in 7m 53s
awx-swagger : SUCCESS in 8m 24s
awx-detect-schema-change : SUCCESS in 10m 04s (non-voting)
awx-ansible-modules : SUCCESS in 2m 30s

fosterseth · 2020-10-06T20:33:44Z

looks good, tested the patch and can confirm the speed boost

fosterseth · 2020-10-06T21:39:30Z

awx/main/scheduler/task_manager.py

                if execution_instance:
+                    execution_instance.capacity -= task.task_impact


could we be double subtracting here if idle_instance_that_fits == execution_instance?

fosterseth · 2020-10-06T21:51:06Z

we should also make sure the changes here are compatible with this code that we call on running tasks, as it looks to modify your self.graph before we process any pending tasks

awx/awx/main/scheduler/task_manager.py

Line 548 in 37b3cc7

    
           self.graph = InstanceGroup.objects.capacity_values(tasks=tasks, graph=self.graph)

awx/awx/main/managers.py

Line 203 in 37b3cc7

def capacity_values(self, qs=None, tasks=None, breakdown=False, graph=None):

softwarefactory-project-zuul · 2020-10-07T00:31:24Z

Build failed.

awx-api-lint : SUCCESS in 1m 47s
awx-api : FAILURE in 7m 27s
awx-ui : SUCCESS in 4m 31s
awx-ui-next : SUCCESS in 8m 20s
awx-swagger : SUCCESS in 7m 54s
awx-detect-schema-change : SUCCESS in 7m 59s (non-voting)
awx-ansible-modules : SUCCESS in 2m 36s

softwarefactory-project-zuul · 2020-10-07T01:37:47Z

Build failed.

awx-api-lint : SUCCESS in 1m 49s
awx-api : FAILURE in 6m 59s
awx-ui : SUCCESS in 3m 49s
awx-ui-next : SUCCESS in 8m 10s
awx-swagger : SUCCESS in 8m 52s
awx-detect-schema-change : SUCCESS in 8m 48s (non-voting)
awx-ansible-modules : SUCCESS in 2m 16s

* Do not query the database for the set of Instance that belong to the group for which we are trying to fit a job on, for each job. * Instead, cache the set of instances per-instance group.

softwarefactory-project-zuul · 2020-10-07T02:06:46Z

Build succeeded.

awx-api-lint : SUCCESS in 1m 49s
awx-api : SUCCESS in 7m 04s
awx-ui : SUCCESS in 4m 59s
awx-ui-next : SUCCESS in 9m 35s
awx-swagger : SUCCESS in 8m 19s
awx-detect-schema-change : SUCCESS in 8m 04s (non-voting)
awx-ansible-modules : SUCCESS in 2m 29s

moonrail · 2020-10-07T12:49:41Z

@chrismeyersfsu
Thank you for mentioning it in #7655

This is game changing - I can confirm a ton faster processing time for starting jobs - your stated 20 times are easily valid.

With START_TASK_LIMIT 10 one run with starting 10 jobs takes < 2 seconds, so > 6 seconds improvement.

With START_TASK_LIMIT 100 one run with starting 100 jobs takes < 7 seconds, so ~20-60s improvement. I've also not experienced the task manager now getting killed, as it did not run longer than 5 minutes.

Still - I have found two potential issues:

The task manager is run continiously (which is good), but at this new lightning fast pace it also leads to a lot of log-output - maybe a problem with external logging arises when DEBUG-Level is required because of this? I have currently no environment to test this.
This is probably not directly an issue with this PR, but since you've added some sort of instance-cache it could be:
From my obvservations the task manager is started on a node and runs there for X iterations/time, until it gives up lock and another node sets lock and it runs there.
I've noticed, that the Task Manager does not use "newly" enabled instances since its start on a specific node.

For this I've set all instances to disabled, started 1k jobs and watched the Task Manager logs. It ran in a loop.
I've then enabled all 15 instanced, but no jobs were started in further Task Manager iterations as there was "0 capacity remaining".
As soon as the Task Manager switched nodes, it detected available instances.

ryanpetrello · 2020-10-12T14:16:18Z

awx/main/scheduler/task_manager.py

@@ -45,17 +46,46 @@
 class TaskManager():

    def __init__(self):
+        '''
+        Do NOT put database queries or other potentially expensive operations


It's times like these that I employ one of my favorite comments:

https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1213

This may just be my excuse to use emojis in source code, though.

ryanpetrello

I like this PR a lot 👍

ryanpetrello · 2020-10-12T14:23:49Z

awx/main/scheduler/task_manager.py

-                    logger.debug("Starting {} in group {} instance {} (remaining_capacity={})".format(
-                                 task.log_format, rampart_group.name, execution_instance.hostname, remaining_capacity))
-                elif not execution_instance and idle_instance_that_fits:
+                execution_instance = InstanceGroup.fit_task_to_most_remaining_capacity_instance(task, self.graph[rampart_group.name]['instances']) or \


Conspiracy theory time, come along and join me @chrismeyersfsu @fosterseth

The task manager starts, obtains the lock, calls after_lock_init, and has a set of known instances. We arrive at a moment in time right before this code.

The user removes an instance via the API, or via awx-manage deprovision_instance, and it's the one we happen to pick here via execution_instance.

What terrible, terrible thing happens next?

hmm good point. Not sure of the behavior, but this method seems to go along with this scenario

awx/awx/main/scheduler/task_manager.py

Line 533 in e6c1249

def reap_jobs_from_orphaned_instances(self):

but that might be more for jobs after .save(). I think in the case you describe, task.save() should fail

I think this is very unlikely to happen, but we should probably test to make sure that it doesn't result in something bad, like a task never running, and getting stuck in waiting forever (or worse, a task-manager breaking exception).

okay I tried this on a local 3 node cluster install. If I delete the execution instance the job is assigned to before the job is saved in start_task method, it still runs successfully on that instance.

Says it fails in the explanation but actually succeeds (I see hello world playbook run output)

I get the following errors in the log, on loop

tools_awx_1_1 | 2020-10-12 18:47:16,633 ERROR awx.main.dispatch encountered an error communicating with redis to store dispatcher statistics tools_awx_1_1 | Traceback (most recent call last): tools_awx_1_1 | File "/awx_devel/awx/main/dispatch/worker/base.py", line 111, in record_statistics tools_awx_1_1 | self.redis.set(f'awx_{self.name}_statistics', self.pool.debug()) tools_awx_1_1 | File "/awx_devel/awx/main/dispatch/pool.py", line 341, in debug tools_awx_1_1 | self.cleanup() tools_awx_1_1 | File "/awx_devel/awx/main/dispatch/pool.py", line 436, in cleanup tools_awx_1_1 | reaper.reap(excluded_uuids=running_uuids) tools_awx_1_1 | File "/awx_devel/awx/main/dispatch/reaper.py", line 38, in reap tools_awx_1_1 | (changed, me) = Instance.objects.get_or_register() tools_awx_1_1 | File "/awx_devel/awx/main/managers.py", line 150, in get_or_register tools_awx_1_1 | return (False, self.me()) tools_awx_1_1 | File "/awx_devel/awx/main/managers.py", line 108, in me tools_awx_1_1 | raise RuntimeError("No instance found with the current cluster host id") tools_awx_1_1 | RuntimeError: No instance found with the current cluster host id . . tools_awx_1_1 | 2020-10-12 18:47:13,473 ERROR awx.main.wsbroadcast AWX is currently installing/upgrading. Trying again in 5s...

ryanpetrello · 2020-10-12T14:26:14Z

awx/main/models/ha.py

        instance_most_capacity = None
-        for i in self.instances.filter(capacity__gt=0, enabled=True).order_by('hostname'):
+        for i in instances:


@chrismeyersfsu @fosterseth to address my race condition comment below, perhaps we should sprinkle in some sort of refresh_from_db call with an try...catch to make absolutely sure the selected instance actually still exists (and wasn't deleted in the time it took the task manager to start and enter this code block?)

ryanpetrello · 2020-10-19T17:25:46Z

replaced by #8403

chrismeyersfsu force-pushed the fix-tm_slow_fit branch from f3f6d2f to ecf2b3f Compare October 6, 2020 19:19

chrismeyersfsu force-pushed the fix-tm_slow_fit branch from ecf2b3f to 0a32d0a Compare October 6, 2020 20:15

chrismeyersfsu requested a review from fosterseth October 6, 2020 20:16

fosterseth approved these changes Oct 6, 2020

View reviewed changes

fosterseth reviewed Oct 6, 2020

View reviewed changes

chrismeyersfsu force-pushed the fix-tm_slow_fit branch from 0a32d0a to 58cba01 Compare October 7, 2020 00:22

chrismeyersfsu force-pushed the fix-tm_slow_fit branch from 58cba01 to bc0701a Compare October 7, 2020 01:28

reduce per-job database query count

964ce64

* Do not query the database for the set of Instance that belong to the group for which we are trying to fit a job on, for each job. * Instead, cache the set of instances per-instance group.

chrismeyersfsu force-pushed the fix-tm_slow_fit branch from bc0701a to 964ce64 Compare October 7, 2020 01:56

moonrail mentioned this pull request Oct 7, 2020

Scheduler/Dispatcher does not start Jobs with >600 pending Jobs #7655

Closed

ryanpetrello reviewed Oct 12, 2020

View reviewed changes

ryanpetrello approved these changes Oct 12, 2020

View reviewed changes

ryanpetrello reviewed Oct 12, 2020

View reviewed changes

ryanpetrello closed this Oct 19, 2020

paulstaffs mentioned this pull request Oct 26, 2020

Jobs running on container instance groups are stuck in pending state #8454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce per-job database query count #8333

reduce per-job database query count #8333

chrismeyersfsu commented Oct 6, 2020 •

edited

Loading

softwarefactory-project-zuul bot commented Oct 6, 2020

softwarefactory-project-zuul bot commented Oct 6, 2020

softwarefactory-project-zuul bot commented Oct 6, 2020

fosterseth commented Oct 6, 2020

fosterseth Oct 6, 2020

fosterseth commented Oct 6, 2020 •

edited

Loading

softwarefactory-project-zuul bot commented Oct 7, 2020

softwarefactory-project-zuul bot commented Oct 7, 2020

softwarefactory-project-zuul bot commented Oct 7, 2020

moonrail commented Oct 7, 2020

ryanpetrello Oct 12, 2020

ryanpetrello left a comment

ryanpetrello Oct 12, 2020 •

edited

Loading

fosterseth Oct 12, 2020 •

edited

Loading

ryanpetrello Oct 12, 2020

fosterseth Oct 12, 2020 •

edited

Loading

ryanpetrello Oct 12, 2020

ryanpetrello Oct 12, 2020 •

edited

Loading

ryanpetrello commented Oct 19, 2020

		if execution_instance:
		execution_instance.capacity -= task.task_impact

reduce per-job database query count #8333

reduce per-job database query count #8333

Conversation

chrismeyersfsu commented Oct 6, 2020 • edited Loading

softwarefactory-project-zuul bot commented Oct 6, 2020

softwarefactory-project-zuul bot commented Oct 6, 2020

softwarefactory-project-zuul bot commented Oct 6, 2020

fosterseth commented Oct 6, 2020

fosterseth Oct 6, 2020

Choose a reason for hiding this comment

fosterseth commented Oct 6, 2020 • edited Loading

softwarefactory-project-zuul bot commented Oct 7, 2020

softwarefactory-project-zuul bot commented Oct 7, 2020

softwarefactory-project-zuul bot commented Oct 7, 2020

moonrail commented Oct 7, 2020

ryanpetrello Oct 12, 2020

Choose a reason for hiding this comment

ryanpetrello left a comment

Choose a reason for hiding this comment

ryanpetrello Oct 12, 2020 • edited Loading

Choose a reason for hiding this comment

fosterseth Oct 12, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Oct 12, 2020

Choose a reason for hiding this comment

fosterseth Oct 12, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Oct 12, 2020

Choose a reason for hiding this comment

ryanpetrello Oct 12, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello commented Oct 19, 2020

chrismeyersfsu commented Oct 6, 2020 •

edited

Loading

fosterseth commented Oct 6, 2020 •

edited

Loading

ryanpetrello Oct 12, 2020 •

edited

Loading

fosterseth Oct 12, 2020 •

edited

Loading

fosterseth Oct 12, 2020 •

edited

Loading

ryanpetrello Oct 12, 2020 •

edited

Loading