-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker reindex can conflict with heartbeat causing a database deadlock #20281
Comments
What script is doing this? In the past we had a script that tried to be smarter than postgres but I thought we abandoned that. |
It's not a script. We built reindex and vacuum into schedules that get run by the application. The schedule and list of tables are in settings here Lines 52 to 58 in 8473a02
|
Also the deadlock behavior can be easily reproduced by running the following two commands in separate rails console sessions. 1000.times { MiqWorker.first.update!(:last_heartbeat => Time.now.utc) } 1000.times { MiqWorker.reindex } |
https://www.postgresql.org/docs/10/routine-reindex.html
I don't know why we're special and need periodic reindexing of the miq_workers table other than exceptional cases where workers are recycling due to errors and even that is an exception. Additionally, I don't think periodic reindexing, even if needed, is once per hour. If we're hesitant to cut this out entirely, I'd be fine with changing the default 1 hour to something more reasonable. It doesn't look like we can specify reindex schedules by table so it would have to be the same frequency for metrics, miq_queue and miq_workers, so perhaps every 6 or 12 hours? I don't know how much bloat we get on miq_queue or metrics indexes in 6 or 12 hours. |
Yeah, I mean this is failing just about every time in larger environments so I think we should consider removing it entirely if we're pretty sure we don't need it. |
What I'm saying is by extending the time I think we'll end up deadlocking a worker every 6 or 12 hours rather than every hour. It's not like workers heartbeat any less frequently when the app isn't being used heavily so we're not going to benefit from doing this in off hours. That's why I listed the only options as stop or fix the reindex 😉 |
I agree. I think we should remove this table from the reindex list. If we must keep it though, every hour seems crazy for even a busy installation. Do we have this problem on the other tables? |
Not in the environment where I saw this, but |
This was causing nearly constant deadlocks in an environment with lots of updates on the workers table. Based on https://www.postgresql.org/docs/10/sql-reindex.html it sounds like reindexes will definitely interfere with updates and that the situations where we actually need a reindex are rather rare. https://bugzilla.redhat.com/show_bug.cgi?id=1846281 Fixes ManageIQ#20281
Originally opened as https://bugzilla.redhat.com/show_bug.cgi?id=1846281
The gist of the problem can be seen in the following postgres log snippet:
It seems REINDEX is incompatible with updates to a table and we update the miq_workers table so frequently that this is occurring almost constantly in large environments.
Based on https://www.postgresql.org/docs/10/sql-reindex.html ...
But it also states that the index should only need to be rebuilt in pretty rare situations. I think there are really only two options:
Looking for input and ideas on this from folks that know a bit of the history around why we're doing this kind of maintenance. @Fryguy @kbrock @gtanzillo @jrafanie ?
The text was updated successfully, but these errors were encountered: