Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database deadlock when running Facts job #8145

Closed
jpell958 opened this issue Sep 15, 2020 · 11 comments
Closed

Database deadlock when running Facts job #8145

jpell958 opened this issue Sep 15, 2020 · 11 comments

Comments

@jpell958
Copy link

jpell958 commented Sep 15, 2020

ISSUE TYPE
  • Bug Report
SUMMARY

An AWX database deadlock occurs when facts jobs (package facts and gather facts) are running in slices with 4,000 targets.

ENVIRONMENT
  • AWX version: 13.0
  • AWX install method: developed Ansible playbook
  • Ansible version: 2.9.11
  • Operating System: RHEL 7
  • Web Browser: Chromium Edge
STEPS TO REPRODUCE

The issue happens every time we run a playbook with this specific set of tasks:

---
- name: Collect Package Facts
 package_facts:
    manager: "auto"
  tags:
    - gather-facts

- name: Collect All Facts
  setup:
    gather_subset: all
  tags:
    - gather-facts

More details will be added to this case as they are gathered

EXPECTED RESULTS
  • No deadlock when running facts job
  • Successfully receive EOF event.
ACTUAL RESULTS
  • Deadlock occurs when running facts job
  • No EOF event received.

Example query used:

UPDATE "main_host" SET "last_job_id" = (CASE WHEN
("main_host"."id" = 8750) THEN 252213 WHEN
("main_host"."id" = 8543) THEN 252212 WHEN
("main_host"."id" = 2245) THEN 252211 WHEN
("main_host"."id" = 2246) THEN 252210 WHEN
("main_host"."id" = 3554) THEN 252268 WHEN
("main_host"."id" = 3553) THEN 252208 WHEN
("main_host"."id" = 3541) THEN 252207 WHEN
("main_host"."id" = 3542)
THEN 252269 WHEN ("main_host"."id" = 428) THEN 252205
WHEN ("main_host"."id" = 651) THEN 252204 WHEN
("main_host"."id" = 10030) THEN 252203 WHEN
("main_host"."id" = 511) THEN 252202 WHEN ("main_host
"."id" = 429) THEN 252213 WHEN ("main_host"."id" = 512)
THEN 252212 WHEN ("main_host"."id" = 294) THEN 252211
WHEN ("main_host"."id" = 295) THEN 252210 WHEN
("main_host"."id" = 296) THEN 252268 WHEN
("main_host"."id" = 335) THEN 252208 WHEN
("main_host"."id" = 430) THEN 252207 WHEN
("main_host"."id" = 539) THEN 252269 WHEN
("main_host"."id" = 297) THEN 252205 WHEN
("main_host"."id" = 720) THEN
252204 WHEN ("main_host"."id" = 710) THEN 252203 WH |
29194 | UPDATE "main_host" SET "last_job_id" = (CASE
WHEN ("main_host"."id" = 8750) THEN 252213 WHEN
("main_host"."id" = 8543) THEN 25221
2 WHEN ("main_host"."id" = 2245) THEN 252211 WHEN
("main_host"."id" = 2246) THEN 252210 WHEN
("main_host"."id" = 3554) THEN 252268 WHEN
("main_host"."id" = 3553) THEN 252208 WHEN
("main_host"."id" =
3541) THEN 252207 WHEN ("main_host"."id" = 3542) THEN
248644 WHEN ("main_host"."id" = 428) THEN 252205 WHEN
("main_host"."id" = 651) THEN 252204 WHEN
("main_host"."id" = 10030) THEN 252203 WHEN ("m
ain_host"."id" = 511) THEN 252202 WHEN
("main_host"."id" = 429) THEN 252213
ADDITIONAL INFORMATION

Screenshot showing job status of Successful, but also showing JOB IS STILL RUNNING
image

django dump:

django.db.utils.OperationalError: deadlock detected
DETAIL:  Process 1412 waits for ShareLock on transaction 612485542; blocked by process 26582.
Process 26582 waits for ShareLock on transaction 612803038; blocked by process 1412.
HINT:  See server log for query details.
CONTEXT:  while deleting tuple (4244,46) in relation "main_host"
During handling of the above exception, another exception occurred:

@wenottingham
Copy link
Contributor

Can you attach the full log as much as possible?

How many slices, and how many nodes in each slice?

@jpell958
Copy link
Author

jpell958 commented Sep 21, 2020

Answer: 273 hosts per slice
inventory_update_247314.txt

There are few updates since we last reported on issue to the community. We observed that one of the indexes, host_ansible_facts_default_gin, on database table main_host has grown significantly large and could be contributing to the slow down whenever a query is performed on main_host table. Currently, we have around 4,000 hosts in one of our inventories. During our collect fact job execution it sends one large update query for all of our 4,000 inventory hosts. This occurs even when only a subset of our inventory hosts are actually affected. This activity is taking hours and is causing all subsequent queries to queue up. We have observed a large update query as a result of the collect fact job. This job ran for as long as 18+ hours before it had to be manually killed.

We would like to ask the community whether updating ‘last_job_id’ and ‘last_job_host_summary_id’, for all hosts, is the expected action; even when only a subset of hosts are targeted (i.e.: running collect facts job against our 4,000 host inventory with a limit).

As a result of a slowdown in database performance, possibly due to a bloated ‘host_ansible_facts_default_gin’ index, we modified the update_host_summary_from_stats method in events.py (under models) to only send an update for hosts that were actually part of the run rather than sending an update for all hosts in the inventory.

This is the original code

for h in all_hosts:
# if the hostname *shows up* in the playbook_on_stats event
if h.name in hostnames:
h.last_job_id = job.id
if h.id in host_mapping:
h.last_job_host_summary_id = host_mapping[h.id]
Host.objects.bulk_update(all_hosts, ['last_job_id', 'last_job_host_summary_id'])

This is the modified code (removed unchanged code blocks from the method’s body)

 for h in all_hosts:
    # if the hostname *shows up* in the playbook_on_stats event
    host_updated = False
    if h.name in hostnames:
        h.last_job_id = job.id
        host_updated = True
    if h.id in host_mapping:
        h.last_job_host_summary_id = host_mapping[h.id]
        host_updated = True
    if host_updated:
        Host.objects.bulk_update([h], ['last_job_id', 'last_job_host_summary_id'])

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Sep 21, 2020

Hey @jpell958,

Thanks for digging into this - can you check the formatting on your Python code diff above (including indentation).

As a result of a slowdown in database performance, possibly due to a bloated ‘host_ansible_facts_default_gin’ index, we modified the update_host_summary_from_stats method in events.py (under models) to only send an update for hosts that were actually part of the run rather than sending an update for all hosts in the inventory.

Did your problem go away with this patch?

If you've got a diff that you think resolves this, or improves performance, please feel free to open a pull request so we can talk it over!

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Sep 22, 2020

@jpell958,

Also, can you try the latest version of AWX? I see that we changed some of this code recently:

#7352

@jpell958
Copy link
Author

@ryanpetrello,

The changes you are referring too are already accounted for in the version of AWX we are using.

With the code change we made, the massive query to update ether ‘last_job_id’ and ‘last_job_host_summary_id’ is now faster
Anecdotally (3 seconds vs. 2 hours). We will not know the total impact of our change until we place a heavy load on the system starting next month.

@ryanpetrello
Copy link
Contributor

@jpell958 👍 are you interested in opening a pull request?

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Sep 22, 2020

Hey @jpell958,

I poked at this and opened a PR which is similar to your suggested diff; I think it should help. Any chance you're interested in giving it a try?

#8206

@jpell958
Copy link
Author

@ryanpetrello,

At one point we implemented the batch size and limited that to 500 hosts. However, we still observed that the operation was taking a long time and blocking other transactions. This kept the main host table locked. We will have a peak load in our environment coming up in 3 weeks and should be able to report back then; calling one host.save() at a time.

Without the load we cannot replicate this issue. We are are working to simulate the load in our lab and will be happy to test your changes then.

One other point - We are observing a increase in our indexes about 500mb per day. We will continue to monitor to see if this has to do with the issue at hand. At one point the index was at 260Gb and we had to rebuild.

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Sep 22, 2020

It might be that one solution here is to just roll back to distinct updates; this optimization may not be worth it given the behavior you're seeing. That said, in my testing w/ a 10k host inventory, the ORM cost of distinct .save() calls is really high (as in, it takes several minutes for the the playbook_on_stats processing to finish serially).

@one-t
Copy link
Contributor

one-t commented Oct 7, 2020

Tested this on a large cluster with 2xlarge instances and fact caching enabled- no deadlock was observed

@one-t one-t closed this as completed Oct 7, 2020
@jpell958
Copy link
Author

jpell958 commented Oct 7, 2020

Thank you for actioning this request and for the testing efforts.
Please note: January will be the next opportunity for us to test this fix within our organization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants