Database deadlock when running Facts job #8145

jpell958 · 2020-09-15T14:54:21Z

ISSUE TYPE

Bug Report

SUMMARY

An AWX database deadlock occurs when facts jobs (package facts and gather facts) are running in slices with 4,000 targets.

ENVIRONMENT

AWX version: 13.0
AWX install method: developed Ansible playbook
Ansible version: 2.9.11
Operating System: RHEL 7
Web Browser: Chromium Edge

STEPS TO REPRODUCE

The issue happens every time we run a playbook with this specific set of tasks:

---
- name: Collect Package Facts
 package_facts:
    manager: "auto"
  tags:
    - gather-facts

- name: Collect All Facts
  setup:
    gather_subset: all
  tags:
    - gather-facts

More details will be added to this case as they are gathered

EXPECTED RESULTS

No deadlock when running facts job
Successfully receive EOF event.

ACTUAL RESULTS

Deadlock occurs when running facts job
No EOF event received.

Example query used:

UPDATE "main_host" SET "last_job_id" = (CASE WHEN
("main_host"."id" = 8750) THEN 252213 WHEN
("main_host"."id" = 8543) THEN 252212 WHEN
("main_host"."id" = 2245) THEN 252211 WHEN
("main_host"."id" = 2246) THEN 252210 WHEN
("main_host"."id" = 3554) THEN 252268 WHEN
("main_host"."id" = 3553) THEN 252208 WHEN
("main_host"."id" = 3541) THEN 252207 WHEN
("main_host"."id" = 3542)
THEN 252269 WHEN ("main_host"."id" = 428) THEN 252205
WHEN ("main_host"."id" = 651) THEN 252204 WHEN
("main_host"."id" = 10030) THEN 252203 WHEN
("main_host"."id" = 511) THEN 252202 WHEN ("main_host
"."id" = 429) THEN 252213 WHEN ("main_host"."id" = 512)
THEN 252212 WHEN ("main_host"."id" = 294) THEN 252211
WHEN ("main_host"."id" = 295) THEN 252210 WHEN
("main_host"."id" = 296) THEN 252268 WHEN
("main_host"."id" = 335) THEN 252208 WHEN
("main_host"."id" = 430) THEN 252207 WHEN
("main_host"."id" = 539) THEN 252269 WHEN
("main_host"."id" = 297) THEN 252205 WHEN
("main_host"."id" = 720) THEN
252204 WHEN ("main_host"."id" = 710) THEN 252203 WH |
29194 | UPDATE "main_host" SET "last_job_id" = (CASE
WHEN ("main_host"."id" = 8750) THEN 252213 WHEN
("main_host"."id" = 8543) THEN 25221
2 WHEN ("main_host"."id" = 2245) THEN 252211 WHEN
("main_host"."id" = 2246) THEN 252210 WHEN
("main_host"."id" = 3554) THEN 252268 WHEN
("main_host"."id" = 3553) THEN 252208 WHEN
("main_host"."id" =
3541) THEN 252207 WHEN ("main_host"."id" = 3542) THEN
248644 WHEN ("main_host"."id" = 428) THEN 252205 WHEN
("main_host"."id" = 651) THEN 252204 WHEN
("main_host"."id" = 10030) THEN 252203 WHEN ("m
ain_host"."id" = 511) THEN 252202 WHEN
("main_host"."id" = 429) THEN 252213

ADDITIONAL INFORMATION

Screenshot showing job status of Successful, but also showing JOB IS STILL RUNNING

django dump:

django.db.utils.OperationalError: deadlock detected
DETAIL:  Process 1412 waits for ShareLock on transaction 612485542; blocked by process 26582.
Process 26582 waits for ShareLock on transaction 612803038; blocked by process 1412.
HINT:  See server log for query details.
CONTEXT:  while deleting tuple (4244,46) in relation "main_host"
During handling of the above exception, another exception occurred:

The text was updated successfully, but these errors were encountered:

wenottingham · 2020-09-18T17:20:40Z

Can you attach the full log as much as possible?

How many slices, and how many nodes in each slice?

jpell958 · 2020-09-21T20:29:14Z

Answer: 273 hosts per slice
inventory_update_247314.txt

There are few updates since we last reported on issue to the community. We observed that one of the indexes, host_ansible_facts_default_gin, on database table main_host has grown significantly large and could be contributing to the slow down whenever a query is performed on main_host table. Currently, we have around 4,000 hosts in one of our inventories. During our collect fact job execution it sends one large update query for all of our 4,000 inventory hosts. This occurs even when only a subset of our inventory hosts are actually affected. This activity is taking hours and is causing all subsequent queries to queue up. We have observed a large update query as a result of the collect fact job. This job ran for as long as 18+ hours before it had to be manually killed.

We would like to ask the community whether updating ‘last_job_id’ and ‘last_job_host_summary_id’, for all hosts, is the expected action; even when only a subset of hosts are targeted (i.e.: running collect facts job against our 4,000 host inventory with a limit).

As a result of a slowdown in database performance, possibly due to a bloated ‘host_ansible_facts_default_gin’ index, we modified the update_host_summary_from_stats method in events.py (under models) to only send an update for hosts that were actually part of the run rather than sending an update for all hosts in the inventory.

This is the original code

awx/awx/main/models/events.py

Lines 523 to 529 in dc492d0

    
           for h in all_hosts: 
        
               # if the hostname *shows up* in the playbook_on_stats event 
        
               if h.name in hostnames: 
        
                   h.last_job_id = job.id 
        
               if h.id in host_mapping: 
        
                   h.last_job_host_summary_id = host_mapping[h.id] 
        
           Host.objects.bulk_update(all_hosts, ['last_job_id', 'last_job_host_summary_id'])

This is the modified code (removed unchanged code blocks from the method’s body)

 for h in all_hosts:
    # if the hostname *shows up* in the playbook_on_stats event
    host_updated = False
    if h.name in hostnames:
        h.last_job_id = job.id
        host_updated = True
    if h.id in host_mapping:
        h.last_job_host_summary_id = host_mapping[h.id]
        host_updated = True
    if host_updated:
        Host.objects.bulk_update([h], ['last_job_id', 'last_job_host_summary_id'])

ryanpetrello · 2020-09-21T20:45:58Z

Hey @jpell958,

Thanks for digging into this - can you check the formatting on your Python code diff above (including indentation).

As a result of a slowdown in database performance, possibly due to a bloated ‘host_ansible_facts_default_gin’ index, we modified the update_host_summary_from_stats method in events.py (under models) to only send an update for hosts that were actually part of the run rather than sending an update for all hosts in the inventory.

Did your problem go away with this patch?

If you've got a diff that you think resolves this, or improves performance, please feel free to open a pull request so we can talk it over!

ryanpetrello · 2020-09-22T15:17:32Z

@jpell958,

Also, can you try the latest version of AWX? I see that we changed some of this code recently:

#7352

jpell958 · 2020-09-22T18:25:29Z

@ryanpetrello,

The changes you are referring too are already accounted for in the version of AWX we are using.

With the code change we made, the massive query to update ether ‘last_job_id’ and ‘last_job_host_summary_id’ is now faster
Anecdotally (3 seconds vs. 2 hours). We will not know the total impact of our change until we place a heavy load on the system starting next month.

ryanpetrello · 2020-09-22T19:00:47Z

@jpell958 👍 are you interested in opening a pull request?

see: ansible#8145

ryanpetrello · 2020-09-22T19:57:21Z

Hey @jpell958,

I poked at this and opened a PR which is similar to your suggested diff; I think it should help. Any chance you're interested in giving it a try?

#8206

see: ansible#8145

jpell958 · 2020-09-22T20:31:07Z

@ryanpetrello,

At one point we implemented the batch size and limited that to 500 hosts. However, we still observed that the operation was taking a long time and blocking other transactions. This kept the main host table locked. We will have a peak load in our environment coming up in 3 weeks and should be able to report back then; calling one host.save() at a time.

Without the load we cannot replicate this issue. We are are working to simulate the load in our lab and will be happy to test your changes then.

One other point - We are observing a increase in our indexes about 500mb per day. We will continue to monitor to see if this has to do with the issue at hand. At one point the index was at 260Gb and we had to rebuild.

ryanpetrello · 2020-09-22T20:41:43Z

It might be that one solution here is to just roll back to distinct updates; this optimization may not be worth it given the behavior you're seeing. That said, in my testing w/ a 10k host inventory, the ORM cost of distinct .save() calls is really high (as in, it takes several minutes for the the playbook_on_stats processing to finish serially).

one-t · 2020-10-07T20:40:39Z

Tested this on a large cluster with 2xlarge instances and fact caching enabled- no deadlock was observed

jpell958 · 2020-10-07T21:42:47Z

Thank you for actioning this request and for the testing efforts.
Please note: January will be the next opportunity for us to test this fix within our organization.

awxbot added the type:bug label Sep 15, 2020

wenottingham added component:api state:needs_devel labels Sep 18, 2020

wenottingham added the state:needs_info label Sep 18, 2020

ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Sep 22, 2020

change host -> last_job_id bulk update query to avoid locking issues

2fbb631

see: ansible#8145

ryanpetrello mentioned this issue Sep 22, 2020

change host -> last_job_id bulk update query to avoid locking issues #8206

Merged

ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Sep 22, 2020

change host -> last_job_id bulk update query to avoid locking issues

89ff8e1

see: ansible#8145

ryanpetrello added state:in_progress and removed state:needs_devel labels Sep 22, 2020

ryanpetrello self-assigned this Sep 22, 2020

ryanpetrello added state:needs_review and removed state:in_progress labels Sep 22, 2020

wenottingham added the priority:medium label Sep 25, 2020

one-t self-assigned this Oct 5, 2020

one-t closed this as completed Oct 7, 2020

kdelee removed the state:needs_test label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database deadlock when running Facts job #8145

Database deadlock when running Facts job #8145

jpell958 commented Sep 15, 2020 •

edited

Loading

wenottingham commented Sep 18, 2020

jpell958 commented Sep 21, 2020 •

edited

Loading

ryanpetrello commented Sep 21, 2020 •

edited

Loading

ryanpetrello commented Sep 22, 2020 •

edited

Loading

jpell958 commented Sep 22, 2020

ryanpetrello commented Sep 22, 2020

ryanpetrello commented Sep 22, 2020 •

edited

Loading

jpell958 commented Sep 22, 2020

ryanpetrello commented Sep 22, 2020 •

edited

Loading

one-t commented Oct 7, 2020

jpell958 commented Oct 7, 2020

Database deadlock when running Facts job #8145

Database deadlock when running Facts job #8145

Comments

jpell958 commented Sep 15, 2020 • edited Loading

ISSUE TYPE

SUMMARY

ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ADDITIONAL INFORMATION

wenottingham commented Sep 18, 2020

jpell958 commented Sep 21, 2020 • edited Loading

ryanpetrello commented Sep 21, 2020 • edited Loading

ryanpetrello commented Sep 22, 2020 • edited Loading

jpell958 commented Sep 22, 2020

ryanpetrello commented Sep 22, 2020

ryanpetrello commented Sep 22, 2020 • edited Loading

jpell958 commented Sep 22, 2020

ryanpetrello commented Sep 22, 2020 • edited Loading

one-t commented Oct 7, 2020

jpell958 commented Oct 7, 2020

jpell958 commented Sep 15, 2020 •

edited

Loading

jpell958 commented Sep 21, 2020 •

edited

Loading

ryanpetrello commented Sep 21, 2020 •

edited

Loading

ryanpetrello commented Sep 22, 2020 •

edited

Loading

ryanpetrello commented Sep 22, 2020 •

edited

Loading

ryanpetrello commented Sep 22, 2020 •

edited

Loading