-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database deadlock when running Facts job #8145
Comments
Can you attach the full log as much as possible? How many slices, and how many nodes in each slice? |
Answer: 273 hosts per slice There are few updates since we last reported on issue to the community. We observed that one of the indexes, host_ansible_facts_default_gin, on database table main_host has grown significantly large and could be contributing to the slow down whenever a query is performed on main_host table. Currently, we have around 4,000 hosts in one of our inventories. During our collect fact job execution it sends one large update query for all of our 4,000 inventory hosts. This occurs even when only a subset of our inventory hosts are actually affected. This activity is taking hours and is causing all subsequent queries to queue up. We have observed a large update query as a result of the collect fact job. This job ran for as long as 18+ hours before it had to be manually killed. We would like to ask the community whether updating ‘last_job_id’ and ‘last_job_host_summary_id’, for all hosts, is the expected action; even when only a subset of hosts are targeted (i.e.: running collect facts job against our 4,000 host inventory with a limit). As a result of a slowdown in database performance, possibly due to a bloated ‘host_ansible_facts_default_gin’ index, we modified the update_host_summary_from_stats method in events.py (under models) to only send an update for hosts that were actually part of the run rather than sending an update for all hosts in the inventory. This is the original code Lines 523 to 529 in dc492d0
This is the modified code (removed unchanged code blocks from the method’s body)
|
Hey @jpell958, Thanks for digging into this - can you check the formatting on your Python code diff above (including indentation).
Did your problem go away with this patch? If you've got a diff that you think resolves this, or improves performance, please feel free to open a pull request so we can talk it over! |
The changes you are referring too are already accounted for in the version of AWX we are using. With the code change we made, the massive query to update ether ‘last_job_id’ and ‘last_job_host_summary_id’ is now faster |
@jpell958 👍 are you interested in opening a pull request? |
At one point we implemented the batch size and limited that to 500 hosts. However, we still observed that the operation was taking a long time and blocking other transactions. This kept the main host table locked. We will have a peak load in our environment coming up in 3 weeks and should be able to report back then; calling one host.save() at a time. Without the load we cannot replicate this issue. We are are working to simulate the load in our lab and will be happy to test your changes then. One other point - We are observing a increase in our indexes about 500mb per day. We will continue to monitor to see if this has to do with the issue at hand. At one point the index was at 260Gb and we had to rebuild. |
It might be that one solution here is to just roll back to distinct updates; this optimization may not be worth it given the behavior you're seeing. That said, in my testing w/ a 10k host inventory, the ORM cost of distinct |
Tested this on a large cluster with 2xlarge instances and fact caching enabled- no deadlock was observed |
Thank you for actioning this request and for the testing efforts. |
ISSUE TYPE
SUMMARY
An AWX database deadlock occurs when facts jobs (package facts and gather facts) are running in slices with 4,000 targets.
ENVIRONMENT
STEPS TO REPRODUCE
The issue happens every time we run a playbook with this specific set of tasks:
More details will be added to this case as they are gathered
EXPECTED RESULTS
ACTUAL RESULTS
Example query used:
ADDITIONAL INFORMATION
Screenshot showing job status of Successful, but also showing JOB IS STILL RUNNING

django dump:
The text was updated successfully, but these errors were encountered: