Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce per-job database query count #8333

Closed
wants to merge 1 commit into from

Conversation

chrismeyersfsu
Copy link
Member

@chrismeyersfsu chrismeyersfsu commented Oct 6, 2020

  • Do not query the database for the set of Instance that belong to the
    group for which we are trying to fit a job on, for each job.
  • Instead, cache the set of instances per-instance group.

Before these changes, the task manager would take 1-2 seconds to decide and begin running a job. With these changes, it takes about 0.05 seconds per job.

Name 200 jobs 1,000 jobs
Fit optimization 11s | 0.055 s/job 58s | 0.058 s/job

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@fosterseth
Copy link
Member

looks good, tested the patch and can confirm the speed boost

if execution_instance:
execution_instance.capacity -= task.task_impact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we be double subtracting here if idle_instance_that_fits == execution_instance?

@fosterseth
Copy link
Member

fosterseth commented Oct 6, 2020

we should also make sure the changes here are compatible with this code that we call on running tasks, as it looks to modify your self.graph before we process any pending tasks

self.graph = InstanceGroup.objects.capacity_values(tasks=tasks, graph=self.graph)

def capacity_values(self, qs=None, tasks=None, breakdown=False, graph=None):

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

* Do not query the database for the set of Instance that belong to the
group for which we are trying to fit a job on, for each job.
* Instead, cache the set of instances per-instance group.
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@moonrail
Copy link

moonrail commented Oct 7, 2020

@chrismeyersfsu
Thank you for mentioning it in #7655

This is game changing - I can confirm a ton faster processing time for starting jobs - your stated 20 times are easily valid.

With START_TASK_LIMIT 10 one run with starting 10 jobs takes < 2 seconds, so > 6 seconds improvement.

With START_TASK_LIMIT 100 one run with starting 100 jobs takes < 7 seconds, so ~20-60s improvement. I've also not experienced the task manager now getting killed, as it did not run longer than 5 minutes.

Still - I have found two potential issues:

  1. The task manager is run continiously (which is good), but at this new lightning fast pace it also leads to a lot of log-output - maybe a problem with external logging arises when DEBUG-Level is required because of this? I have currently no environment to test this.
  2. This is probably not directly an issue with this PR, but since you've added some sort of instance-cache it could be:
    From my obvservations the task manager is started on a node and runs there for X iterations/time, until it gives up lock and another node sets lock and it runs there.
    I've noticed, that the Task Manager does not use "newly" enabled instances since its start on a specific node.

    For this I've set all instances to disabled, started 1k jobs and watched the Task Manager logs. It ran in a loop.
    I've then enabled all 15 instanced, but no jobs were started in further Task Manager iterations as there was "0 capacity remaining".
    As soon as the Task Manager switched nodes, it detected available instances.

@@ -45,17 +46,46 @@
class TaskManager():

def __init__(self):
'''
Do NOT put database queries or other potentially expensive operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's times like these that I employ one of my favorite comments:

https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L1213

This may just be my excuse to use emojis in source code, though.

Copy link
Contributor

@ryanpetrello ryanpetrello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR a lot 👍

logger.debug("Starting {} in group {} instance {} (remaining_capacity={})".format(
task.log_format, rampart_group.name, execution_instance.hostname, remaining_capacity))
elif not execution_instance and idle_instance_that_fits:
execution_instance = InstanceGroup.fit_task_to_most_remaining_capacity_instance(task, self.graph[rampart_group.name]['instances']) or \
Copy link
Contributor

@ryanpetrello ryanpetrello Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conspiracy theory time, come along and join me @chrismeyersfsu @fosterseth

  1. The task manager starts, obtains the lock, calls after_lock_init, and has a set of known instances. We arrive at a moment in time right before this code.
  2. The user removes an instance via the API, or via awx-manage deprovision_instance, and it's the one we happen to pick here via execution_instance.
  3. What terrible, terrible thing happens next?

Copy link
Member

@fosterseth fosterseth Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm good point. Not sure of the behavior, but this method seems to go along with this scenario

def reap_jobs_from_orphaned_instances(self):

but that might be more for jobs after .save(). I think in the case you describe, task.save() should fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very unlikely to happen, but we should probably test to make sure that it doesn't result in something bad, like a task never running, and getting stuck in waiting forever (or worse, a task-manager breaking exception).

Copy link
Member

@fosterseth fosterseth Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I tried this on a local 3 node cluster install. If I delete the execution instance the job is assigned to before the job is saved in start_task method, it still runs successfully on that instance.

image

Says it fails in the explanation but actually succeeds (I see hello world playbook run output)

I get the following errors in the log, on loop

tools_awx_1_1 | 2020-10-12 18:47:16,633 ERROR    awx.main.dispatch encountered an error communicating with redis to store dispatcher statistics
tools_awx_1_1 | Traceback (most recent call last):
tools_awx_1_1 |   File "/awx_devel/awx/main/dispatch/worker/base.py", line 111, in record_statistics
tools_awx_1_1 |     self.redis.set(f'awx_{self.name}_statistics', self.pool.debug())
tools_awx_1_1 |   File "/awx_devel/awx/main/dispatch/pool.py", line 341, in debug
tools_awx_1_1 |     self.cleanup()
tools_awx_1_1 |   File "/awx_devel/awx/main/dispatch/pool.py", line 436, in cleanup
tools_awx_1_1 |     reaper.reap(excluded_uuids=running_uuids)
tools_awx_1_1 |   File "/awx_devel/awx/main/dispatch/reaper.py", line 38, in reap
tools_awx_1_1 |     (changed, me) = Instance.objects.get_or_register()
tools_awx_1_1 |   File "/awx_devel/awx/main/managers.py", line 150, in get_or_register
tools_awx_1_1 |     return (False, self.me())
tools_awx_1_1 |   File "/awx_devel/awx/main/managers.py", line 108, in me
tools_awx_1_1 |     raise RuntimeError("No instance found with the current cluster host id")
tools_awx_1_1 | RuntimeError: No instance found with the current cluster host id
.
.
tools_awx_1_1 | 2020-10-12 18:47:13,473 ERROR    awx.main.wsbroadcast AWX is currently installing/upgrading.  Trying again in 5s...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💥

instance_most_capacity = None
for i in self.instances.filter(capacity__gt=0, enabled=True).order_by('hostname'):
for i in instances:
Copy link
Contributor

@ryanpetrello ryanpetrello Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrismeyersfsu @fosterseth to address my race condition comment below, perhaps we should sprinkle in some sort of refresh_from_db call with an try...catch to make absolutely sure the selected instance actually still exists (and wasn't deleted in the time it took the task manager to start and enter this code block?)

@ryanpetrello
Copy link
Contributor

replaced by #8403

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants