-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler/Dispatcher does not start Jobs with >600 pending Jobs #7655
Comments
Probably this also also affects running jobs. In #7659 we've described how we spawn 10-30 jobs per minute. In logs I can see, that jobs are being processed less and less when the number of pending jobs increases. Our cluster now reached 5900 pending jobs & 75 running jobs. Processes of ansible-playbook on instances are still running (note cpu time):
Edit 1:
Although sometimes they're about 300+ seconds, as initially described and some queries now run pretty long:
Database-Server is chilling:
Connections are also fine:
Latency between nodes & db is as good as it gets:
Edit 2:
Strace of a process of ansible-playbook on first instance in cluster:
So it looks like, the awx-manage process opens a lot of transactions & just executes one "SELECT" in it. Edit 3:
This could be caused by the amount of parallel calls to sq_inq by running our test playbooks against the same hosts. |
@ryanpetrello please take a look at this. |
@psuriset of interest to scale team |
@fosterseth if you've got some available time next week, can you take a peek at this? |
@fosterseth I'm interested in this issue so let me know if you want help testing |
In your test setup, are the jobs you are launching all from the same job template? Do you experience the same issues if you have concurrent jobs disabled in the job template? (set This would mean only 1 job would run per instance at a time. The scheduler is very optimized in this case, and I would expect it to behave well even with thousands of pending jobs. I am still looking into the other case -- jobs can run concurrently. |
@fosterseth We did not test with non-concurrent job templates as in production we will have single job templates that may be run for a lot of organizations & inventories. Would test results for the following two tests help you?
2.:
|
I think we need to find the simplest test scenario that also replicates the issue.
example.sh
if the above goes okay, then swap the above JT with one you use and retest. if it does not work (jobs don't go from pending to running), then try it again but this time with Enable Concurrent Jobs set to False The above test will help me determine if the problem is task manager related. |
I've followed your instructions, but used 15 instead of 10 instances. First test with concurrent jobs enabled1001 pending jobs in instance group "tower" and all instances disabled:
The scheduler takes around 37 seconds to skip all 1001 pending jobs:
After enabling all instances via API (finished at 08:44:49), the node that was running the scheduler task did not detect available instances. The scheduler ran several times with the following output, until 08:47:55:
No jobs were started in this period. Following this, the scheduler was started on another node - this node then started scheduling correctly:
No jobs were started at this time, they were just assigned to instances. But this scheduler run never finished correctly, it was killed:
And the scheduler started from the beginning:
Second test with concurrent jobs disabledOutput is nearly the same, scheduler does submit jobs, but is killed:
Why does the scheduler work sequentially? This seems to me as the wrong concept for clustered active/active-Environments. Could instances not take jobs themselves from a queue and process them at will - why does the scheduler have to control any processing? |
Any updates? We are still experiencing this with 14.0.0. |
@moonrail was just talking to @fosterseth about this ...think we should have update early next week. |
@moonrail thanks for running the tests and the detailed response A while back we had an issue with a cpython bug that was causing task manager hangs, so we put some code in place to reap the task manager after it runs for 5 minutes. In your case the task manager is not hanging, it's just that the loop for starting jobs is taking > 5 minutes. Our transactions are atomic, so reaping the task manager doesn't commit the transaction, which is why jobs aren't actually starting (they don't run) and the next task manager run starts over with the first job. Some possible solutions
@ryanpetrello do you think it's safe to remove the task manager reaper that is in place? |
I'm still a little nervous about removing that - if the task manager takes 5+ minutes to start jobs, it makes me wonder when/if it would ever recover. I do like the failsafe that it gives us to prevent deadlocks. |
one quick fix could be to limit the number of jobs a task manager can start on a given run. When it reaches that limit, it won't start any more a quick implementation If this limit is reached, we can immediately schedule a task_manager to run right after the current cycle ends, that way there isn't a delay (which is around 20 seconds or so) between runs I'd say a limit of around 150 jobs seems low enough that even at scale (5k jobs) the task manager should get through all pending jobs within a couple of minutes. |
@fosterseth yea I've thought on it some, and I think I like this idea - could you turn it into a PR? |
@mcharanrm re-created this issue on a Tower 3.7.2 and then upgraded to a build of devel that included this patch. At beginning of time scale we see that no jobs were running on the 3.7.2 instance and that over 1k jobs were in pending. The blank part is where we were running the upgrade. Then after upgrade it started scheduling 100 jobs at a time and running them to completion. While something bad could happen that could still cause task manager to timeout while scheduling 100 jobs, if this were the case the user could adjust START_TASK_LIMIT to meet their needs. I'm going to say this is verified and will be released in next release of AWX. |
Sorry for being late on testing. After testing in combination with But even with
Now I'm beginning to wonder if the default START_TASK_LIMIT should be lower. The example Playbook ist fairly short (30s + 7s overhead), so it should provide a solid baseline for job-latency. Few ansible-playbooks will be shorter, but many will be longer and consume instance capacity over a longer time. If AWX nodes are being added/enabled, or if "big" capacity jobs finish after the Task Manager is started, it will not use their capacity in this run. So higher START_TASK_LIMIT values will also lead to longer delay in using nodes available capacity. Why not reduce the default value of START_TASK_LIMIT from 100 to a way more responsive 10? This should improve the user experience by quite a margin and looks less like "AWX is frozen/deadlocked", as the time to wait until jobs are being executed/assigned is significantly lowered. From my understanding this should also make more use of available capacity and therefore scale a bit better. |
If you get a chance @moonrail https://github.com/ansible/awx/compare/devel...chrismeyersfsu:fix-tm_slow_fit?expand=1 This reduces the time of starting an individual job from 1-2 seconds down to .05 seconds. So you should see a 20x + speedup. |
ISSUE TYPE
SUMMARY
With a lot of pending Jobs (e.g. if Instances were disabled/down for maintenance, Cluster was restarted, Failover, ...) the Job-Scheduler/Dispatcher "run_task_manager" will not finish in under 5 Minutes and be killed by: https://github.com/ansible/awx/blob/devel/awx/main/dispatch/pool.py#L385
From our observation Jobs will only be executed, if Scheduler/Dispatcher ran through all pending/new Jobs and dispatched/skipped them.
Therefore no Jobs will be executed, and "run_callback_receiver" will never be run.
We've tried it with following setups:
ENVIRONMENT
STEPS TO REPRODUCE
First method:
Second method:
EXPECTED RESULTS
Either:
Or:
ACTUAL RESULTS
ADDITIONAL INFORMATION
Example log output of the Scheduler/Dispatcher being killed:
The text was updated successfully, but these errors were encountered: