Worker reindex can conflict with heartbeat causing a database deadlock #20281

carbonin · 2020-06-16T19:57:06Z

Originally opened as https://bugzilla.redhat.com/show_bug.cgi?id=1846281

The gist of the problem can be seen in the following postgres log snippet:

2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:LOG:  process 7274 detected deadlock while waiting for AccessExclusiveLock on relation 170417 of database 21747 after 5000.217 ms
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:DETAIL:  Processes holding the lock: 6569, 6657, 7052, 32826, 18539, 42134, 8095, 7900, 8204, 6711, 33341. Wait queue: .
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:STATEMENT:  REINDEX TABLE "miq_workers"
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:ERROR:  deadlock detected
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:DETAIL:  Process 7274 waits for AccessExclusiveLock on relation 170417 of database 21747; blocked by process 6569.
        Process 6569 waits for RowExclusiveLock on relation 28088 of database 21747; blocked by process 7274.
        Process 7274: REINDEX TABLE "miq_workers"
        Process 6569: UPDATE "miq_workers" SET "proportional_set_size" = $1 WHERE "miq_workers"."id" = $2
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:HINT:  See server log for query details.
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:STATEMENT:  REINDEX TABLE "miq_workers"

It seems REINDEX is incompatible with updates to a table and we update the miq_workers table so frequently that this is occurring almost constantly in large environments.

Based on https://www.postgresql.org/docs/10/sql-reindex.html ...

To build the index without interfering with production you should drop the index and reissue the CREATE INDEX CONCURRENTLY command.

But it also states that the index should only need to be rebuilt in pretty rare situations. I think there are really only two options:

Stop reindexing entirely
- This is something that is an easy change and can also be done by users by just editing settings
- If there is a performance problem caused by a bloated index, we need to know how to recognize it and resolve it
Change the reindex method to do something more clever like dropping and recreating the index
- This is described in https://www.postgresql.org/docs/10/routine-reindex.html but it doesn't sound trivial

Looking for input and ideas on this from folks that know a bit of the history around why we're doing this kind of maintenance. @Fryguy @kbrock @gtanzillo @jrafanie ?

The text was updated successfully, but these errors were encountered:

Fryguy · 2020-06-16T20:24:08Z

What script is doing this? In the past we had a script that tried to be smarter than postgres but I thought we abandoned that.

carbonin · 2020-06-16T20:52:11Z

It's not a script. We built reindex and vacuum into schedules that get run by the application.
Here's the PR that added reindex #16929

The schedule and list of tables are in settings here

manageiq/config/settings.yml

Lines 52 to 58 in 8473a02

    
           :database: 
        
             :maintenance: 
        
               :reindex_schedule: "1 * * * *" 
        
               :reindex_tables: 
        
               - Metric 
        
               - MiqQueue 
        
               - MiqWorker

carbonin · 2020-06-16T21:02:57Z

Also the deadlock behavior can be easily reproduced by running the following two commands in separate rails console sessions.

1000.times { MiqWorker.first.update!(:last_heartbeat => Time.now.utc) }

1000.times { MiqWorker.reindex }

jrafanie · 2020-06-17T19:51:51Z

https://www.postgresql.org/docs/10/routine-reindex.html

"In some situations it is worthwhile to rebuild indexes periodically with the REINDEX command or a series of individual rebuilding steps.

B-tree index pages that have become completely empty are reclaimed for re-use. However, there is still a possibility of inefficient use of space: if all but a few index keys on a page have been deleted, the page remains allocated. Therefore, a usage pattern in which most, but not all, keys in each range are eventually deleted will see poor use of space. For such usage patterns, periodic reindexing is recommended.

The potential for bloat in non-B-tree indexes has not been well researched. It is a good idea to periodically monitor the index's physical size when using any non-B-tree index type."

I don't know why we're special and need periodic reindexing of the miq_workers table other than exceptional cases where workers are recycling due to errors and even that is an exception. Additionally, I don't think periodic reindexing, even if needed, is once per hour.

If we're hesitant to cut this out entirely, I'd be fine with changing the default 1 hour to something more reasonable. It doesn't look like we can specify reindex schedules by table so it would have to be the same frequency for metrics, miq_queue and miq_workers, so perhaps every 6 or 12 hours? I don't know how much bloat we get on miq_queue or metrics indexes in 6 or 12 hours.

carbonin · 2020-06-17T20:26:03Z

Yeah, I mean this is failing just about every time in larger environments so I think we should consider removing it entirely if we're pretty sure we don't need it.

carbonin · 2020-06-17T20:28:17Z

What I'm saying is by extending the time I think we'll end up deadlocking a worker every 6 or 12 hours rather than every hour.

It's not like workers heartbeat any less frequently when the app isn't being used heavily so we're not going to benefit from doing this in off hours. That's why I listed the only options as stop or fix the reindex 😉

jrafanie · 2020-06-17T21:14:29Z

What I'm saying is by extending the time I think we'll end up deadlocking a worker every 6 or 12 hours rather than every hour.

I agree. I think we should remove this table from the reindex list. If we must keep it though, every hour seems crazy for even a busy installation.

Do we have this problem on the other tables?

carbonin · 2020-06-17T21:17:14Z

Do we have this problem on the other tables?

Not in the environment where I saw this, but REINDEX will always conflict with updates in this way for any table I imagine. We just might not churn quite as frequently on the others.

This was causing nearly constant deadlocks in an environment with lots of updates on the workers table. Based on https://www.postgresql.org/docs/10/sql-reindex.html it sounds like reindexes will definitely interfere with updates and that the situations where we actually need a reindex are rather rare. https://bugzilla.redhat.com/show_bug.cgi?id=1846281 Fixes ManageIQ#20281

carbonin added bug core labels Jun 16, 2020

carbonin self-assigned this Jun 16, 2020

carbonin mentioned this issue Jun 23, 2020

Remove the miq_workers table from the reindex list #20305

Merged

jrafanie closed this as completed in #20305 Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker reindex can conflict with heartbeat causing a database deadlock #20281

Worker reindex can conflict with heartbeat causing a database deadlock #20281

carbonin commented Jun 16, 2020 •

edited

Loading

Fryguy commented Jun 16, 2020

carbonin commented Jun 16, 2020

carbonin commented Jun 16, 2020

jrafanie commented Jun 17, 2020 •

edited

Loading

carbonin commented Jun 17, 2020

carbonin commented Jun 17, 2020

jrafanie commented Jun 17, 2020

carbonin commented Jun 17, 2020

Worker reindex can conflict with heartbeat causing a database deadlock #20281

Worker reindex can conflict with heartbeat causing a database deadlock #20281

Comments

carbonin commented Jun 16, 2020 • edited Loading

Fryguy commented Jun 16, 2020

carbonin commented Jun 16, 2020

carbonin commented Jun 16, 2020

jrafanie commented Jun 17, 2020 • edited Loading

carbonin commented Jun 17, 2020

carbonin commented Jun 17, 2020

jrafanie commented Jun 17, 2020

carbonin commented Jun 17, 2020

carbonin commented Jun 16, 2020 •

edited

Loading

jrafanie commented Jun 17, 2020 •

edited

Loading