Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker reindex can conflict with heartbeat causing a database deadlock #20281

Closed
carbonin opened this issue Jun 16, 2020 · 8 comments · Fixed by #20305
Closed

Worker reindex can conflict with heartbeat causing a database deadlock #20281

carbonin opened this issue Jun 16, 2020 · 8 comments · Fixed by #20305
Assignees

Comments

@carbonin
Copy link
Member

carbonin commented Jun 16, 2020

Originally opened as https://bugzilla.redhat.com/show_bug.cgi?id=1846281

The gist of the problem can be seen in the following postgres log snippet:

2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:LOG:  process 7274 detected deadlock while waiting for AccessExclusiveLock on relation 170417 of database 21747 after 5000.217 ms
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:DETAIL:  Processes holding the lock: 6569, 6657, 7052, 32826, 18539, 42134, 8095, 7900, 8204, 6711, 33341. Wait queue: .
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:STATEMENT:  REINDEX TABLE "miq_workers"
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:ERROR:  deadlock detected
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:DETAIL:  Process 7274 waits for AccessExclusiveLock on relation 170417 of database 21747; blocked by process 6569.
        Process 6569 waits for RowExclusiveLock on relation 28088 of database 21747; blocked by process 7274.
        Process 7274: REINDEX TABLE "miq_workers"
        Process 6569: UPDATE "miq_workers" SET "proportional_set_size" = $1 WHERE "miq_workers"."id" = $2
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:HINT:  See server log for query details.
2020-06-15 04:01:08 EDT:(51358):5ee31d37.1c6a:root@vmdb_production:[7274]:STATEMENT:  REINDEX TABLE "miq_workers"

It seems REINDEX is incompatible with updates to a table and we update the miq_workers table so frequently that this is occurring almost constantly in large environments.

Based on https://www.postgresql.org/docs/10/sql-reindex.html ...

To build the index without interfering with production you should drop the index and reissue the CREATE INDEX CONCURRENTLY command.

But it also states that the index should only need to be rebuilt in pretty rare situations. I think there are really only two options:

  1. Stop reindexing entirely
    • This is something that is an easy change and can also be done by users by just editing settings
    • If there is a performance problem caused by a bloated index, we need to know how to recognize it and resolve it
  2. Change the reindex method to do something more clever like dropping and recreating the index

Looking for input and ideas on this from folks that know a bit of the history around why we're doing this kind of maintenance. @Fryguy @kbrock @gtanzillo @jrafanie ?

@carbonin carbonin self-assigned this Jun 16, 2020
@Fryguy
Copy link
Member

Fryguy commented Jun 16, 2020

What script is doing this? In the past we had a script that tried to be smarter than postgres but I thought we abandoned that.

@carbonin
Copy link
Member Author

It's not a script. We built reindex and vacuum into schedules that get run by the application.
Here's the PR that added reindex #16929

The schedule and list of tables are in settings here

:database:
:maintenance:
:reindex_schedule: "1 * * * *"
:reindex_tables:
- Metric
- MiqQueue
- MiqWorker

@carbonin
Copy link
Member Author

Also the deadlock behavior can be easily reproduced by running the following two commands in separate rails console sessions.

1000.times { MiqWorker.first.update!(:last_heartbeat => Time.now.utc) }
1000.times { MiqWorker.reindex }

@jrafanie
Copy link
Member

jrafanie commented Jun 17, 2020

https://www.postgresql.org/docs/10/routine-reindex.html

"In some situations it is worthwhile to rebuild indexes periodically with the REINDEX command or a series of individual rebuilding steps.

B-tree index pages that have become completely empty are reclaimed for re-use. However, there is still a possibility of inefficient use of space: if all but a few index keys on a page have been deleted, the page remains allocated. Therefore, a usage pattern in which most, but not all, keys in each range are eventually deleted will see poor use of space. For such usage patterns, periodic reindexing is recommended.

The potential for bloat in non-B-tree indexes has not been well researched. It is a good idea to periodically monitor the index's physical size when using any non-B-tree index type."

I don't know why we're special and need periodic reindexing of the miq_workers table other than exceptional cases where workers are recycling due to errors and even that is an exception. Additionally, I don't think periodic reindexing, even if needed, is once per hour.

If we're hesitant to cut this out entirely, I'd be fine with changing the default 1 hour to something more reasonable. It doesn't look like we can specify reindex schedules by table so it would have to be the same frequency for metrics, miq_queue and miq_workers, so perhaps every 6 or 12 hours? I don't know how much bloat we get on miq_queue or metrics indexes in 6 or 12 hours.

@carbonin
Copy link
Member Author

Yeah, I mean this is failing just about every time in larger environments so I think we should consider removing it entirely if we're pretty sure we don't need it.

@carbonin
Copy link
Member Author

What I'm saying is by extending the time I think we'll end up deadlocking a worker every 6 or 12 hours rather than every hour.

It's not like workers heartbeat any less frequently when the app isn't being used heavily so we're not going to benefit from doing this in off hours. That's why I listed the only options as stop or fix the reindex 😉

@jrafanie
Copy link
Member

What I'm saying is by extending the time I think we'll end up deadlocking a worker every 6 or 12 hours rather than every hour.

I agree. I think we should remove this table from the reindex list. If we must keep it though, every hour seems crazy for even a busy installation.

Do we have this problem on the other tables?

@carbonin
Copy link
Member Author

Do we have this problem on the other tables?

Not in the environment where I saw this, but REINDEX will always conflict with updates in this way for any table I imagine. We just might not churn quite as frequently on the others.

carbonin added a commit to carbonin/manageiq that referenced this issue Jun 23, 2020
This was causing nearly constant deadlocks in an environment with
lots of updates on the workers table. Based on
https://www.postgresql.org/docs/10/sql-reindex.html it sounds like
reindexes will definitely interfere with updates and that the situations
where we actually need a reindex are rather rare.

https://bugzilla.redhat.com/show_bug.cgi?id=1846281
Fixes ManageIQ#20281
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants