-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce per-job database query count #8333
Conversation
Build failed.
|
f3f6d2f
to
ecf2b3f
Compare
Build failed.
|
ecf2b3f
to
0a32d0a
Compare
Build succeeded.
|
looks good, tested the patch and can confirm the speed boost |
awx/main/scheduler/task_manager.py
Outdated
if execution_instance: | ||
execution_instance.capacity -= task.task_impact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we be double subtracting here if idle_instance_that_fits == execution_instance
?
we should also make sure the changes here are compatible with this code that we call on running tasks, as it looks to modify your awx/awx/main/scheduler/task_manager.py Line 548 in 37b3cc7
Line 203 in 37b3cc7
|
0a32d0a
to
58cba01
Compare
Build failed.
|
58cba01
to
bc0701a
Compare
Build failed.
|
* Do not query the database for the set of Instance that belong to the group for which we are trying to fit a job on, for each job. * Instead, cache the set of instances per-instance group.
bc0701a
to
964ce64
Compare
Build succeeded.
|
@chrismeyersfsu This is game changing - I can confirm a ton faster processing time for starting jobs - your stated 20 times are easily valid. With START_TASK_LIMIT 10 one run with starting 10 jobs takes < 2 seconds, so > 6 seconds improvement. With START_TASK_LIMIT 100 one run with starting 100 jobs takes < 7 seconds, so ~20-60s improvement. I've also not experienced the task manager now getting killed, as it did not run longer than 5 minutes. Still - I have found two potential issues:
|
@@ -45,17 +46,46 @@ | |||
class TaskManager(): | |||
|
|||
def __init__(self): | |||
''' | |||
Do NOT put database queries or other potentially expensive operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's times like these that I employ one of my favorite comments:
https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1213
This may just be my excuse to use emojis in source code, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this PR a lot 👍
logger.debug("Starting {} in group {} instance {} (remaining_capacity={})".format( | ||
task.log_format, rampart_group.name, execution_instance.hostname, remaining_capacity)) | ||
elif not execution_instance and idle_instance_that_fits: | ||
execution_instance = InstanceGroup.fit_task_to_most_remaining_capacity_instance(task, self.graph[rampart_group.name]['instances']) or \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conspiracy theory time, come along and join me @chrismeyersfsu @fosterseth
- The task manager starts, obtains the lock, calls
after_lock_init
, and has a set of known instances. We arrive at a moment in time right before this code. - The user removes an instance via the API, or via
awx-manage deprovision_instance
, and it's the one we happen to pick here viaexecution_instance
. - What terrible, terrible thing happens next?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm good point. Not sure of the behavior, but this method seems to go along with this scenario
awx/awx/main/scheduler/task_manager.py
Line 533 in e6c1249
def reap_jobs_from_orphaned_instances(self): |
but that might be more for jobs after .save()
. I think in the case you describe, task.save()
should fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is very unlikely to happen, but we should probably test to make sure that it doesn't result in something bad, like a task never running, and getting stuck in waiting
forever (or worse, a task-manager breaking exception).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I tried this on a local 3 node cluster install. If I delete the execution instance the job is assigned to before the job is saved in start_task
method, it still runs successfully on that instance.
Says it fails in the explanation but actually succeeds (I see hello world playbook run output)
I get the following errors in the log, on loop
tools_awx_1_1 | 2020-10-12 18:47:16,633 ERROR awx.main.dispatch encountered an error communicating with redis to store dispatcher statistics
tools_awx_1_1 | Traceback (most recent call last):
tools_awx_1_1 | File "/awx_devel/awx/main/dispatch/worker/base.py", line 111, in record_statistics
tools_awx_1_1 | self.redis.set(f'awx_{self.name}_statistics', self.pool.debug())
tools_awx_1_1 | File "/awx_devel/awx/main/dispatch/pool.py", line 341, in debug
tools_awx_1_1 | self.cleanup()
tools_awx_1_1 | File "/awx_devel/awx/main/dispatch/pool.py", line 436, in cleanup
tools_awx_1_1 | reaper.reap(excluded_uuids=running_uuids)
tools_awx_1_1 | File "/awx_devel/awx/main/dispatch/reaper.py", line 38, in reap
tools_awx_1_1 | (changed, me) = Instance.objects.get_or_register()
tools_awx_1_1 | File "/awx_devel/awx/main/managers.py", line 150, in get_or_register
tools_awx_1_1 | return (False, self.me())
tools_awx_1_1 | File "/awx_devel/awx/main/managers.py", line 108, in me
tools_awx_1_1 | raise RuntimeError("No instance found with the current cluster host id")
tools_awx_1_1 | RuntimeError: No instance found with the current cluster host id
.
.
tools_awx_1_1 | 2020-10-12 18:47:13,473 ERROR awx.main.wsbroadcast AWX is currently installing/upgrading. Trying again in 5s...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💥
instance_most_capacity = None | ||
for i in self.instances.filter(capacity__gt=0, enabled=True).order_by('hostname'): | ||
for i in instances: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrismeyersfsu @fosterseth to address my race condition comment below, perhaps we should sprinkle in some sort of refresh_from_db
call with an try...catch
to make absolutely sure the selected instance actually still exists (and wasn't deleted in the time it took the task manager to start and enter this code block?)
replaced by #8403 |
group for which we are trying to fit a job on, for each job.
Before these changes, the task manager would take 1-2 seconds to decide and begin running a job. With these changes, it takes about 0.05 seconds per job.