Enable the reindexer by default #718

bgentry · 2025-01-13T16:41:47Z

Following the questions and discussion in #717, this PR enables the reindexer by default. It was originally disabled in #34 in the lead up to release because we felt we may want to spend more time addressing potential downsides (like multiple simultaneous reindexings).

Just from looking at the demo app, it's clear that the GIN indexes are particularly prone to long-term bloat. Pulling from what I posted in that discussion, here's the relevant index sizes before and after a manual concurrent reindex call:

riverdemo=# SELECT
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes;
               index_name               | index_size 
----------------------------------------+------------
 river_job_prioritized_fetching_index   | 80 MB
 river_job_args_index                   | 4013 MB
 river_job_metadata_index               | 645 MB
...

riverdemo=# SELECT                                    
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes;
               index_name               | index_size 
----------------------------------------+------------
 river_job_prioritized_fetching_index   | 38 MB
 river_job_args_index                   | 35 MB
 river_job_metadata_index               | 5656 kB
...

This is likely the reason the Fly deployment where the demo is running keeps running out of disk space on its Postgres instance. There might be situations where river_job_prioritized_fetching_index would also benefit from reindexing, but it's not clear that it needs it to the same extent. Perhaps with a higher throughput use case (i.e. Heroku's API per #59) it might matter.

brandur

A couple thoughts:

Two data points on index size tell you something, but when evaluating the value of something like this, it'd be better to know about index growth over time instead (even if it were three data points instead of two). Indexes tend to degrade from their pristine state fairly quickly, but it's not a big problem unless they continue to degrade. In the demo app's case, I wonder if it might grow to hit 80 MB, but then basically become stable there.
What about the original concerns about possible slow reindexer calls? Without getting too fancy, it's probably a good idea to put in a check to verify that a reindex for a specific index isn't already running before starting a new one. This could be done potentially with pg_stat_progress_create_index .
- Along similar lines, looking at the implementation, it would definitely be better if each reindex was kicked off in sequence rather than all simultaneously. This would require some of the code to be augmented, but it'd be clearly better if it were that way.
This is definitely in the territory of non-trivial risk, so we should probably need to have a way of turning it off. I wonder if the package should export a NeverSchedule or something like that which you could pass to ReindexerSchedule to disable it.

brandur · 2025-01-15T23:08:39Z

client.go

+	// ReindexerIndexes is a list of indexes to reindex on each run of the
+	// reindexer. If empty, defaults to only the args and metadata GIN indexes
+	// (river_job_args_index and river_job_metadata_index).
+	ReindexerIndexes []string


Especially given that no one's asking for this, IMO better to leave internal for now. Just improves future flexibility and I can't think of a good reason to have a non-default.

I think the reason would be if you want to add an additional index to the default list, like if you were doing enough throughput that river_job_prioritized_fetching_index was seeing a lot of bloat. But it can definitely go away for now and be re-added later when requested.

CHANGELOG.md

brandur · 2025-01-15T23:10:19Z

client_test.go

@@ -3656,6 +3656,17 @@ func Test_Client_Maintenance(t *testing.T) {
 	})
 }

+type runOnceScheduler struct {


Maybe call this a "schedule" instead of a "scheduler" given the existing naming of types like PeriodicSchedule and DefaultReindexerSchedule.

client_test.go

bgentry · 2025-01-16T14:34:15Z

This is definitely in the territory of non-trivial risk, so we should probably need to have a way of turning it off. I wonder if the package should export a NeverSchedule or something like that which you could pass to ReindexerSchedule to disable it.

Hah, I actually meant to add a “never schedule” to this PR but forgot that item before I pushed it. That’s exactly what I was thinking for disabling it. The other option could be to pass a non nil empty list of indexes, but that seemed a bit more error prone. The other benefit of the never schedule design is we should be able to apply that same design to other maintenance services to disable them.

Will get that added.

In the demo app's case, I wonder if it might grow to hit 80 MB, but then basically become stable there.

This may not have been clear enough from my original PR but I wanted to make sure you saw that I did not include the ~80 MB river_job_prioritized_fetching_index in the default list here, specifically because I believe the b-tree indexes are generally quite well maintained by auto-vacuum. The ones I did include here are the args and metadata GIN indexes, which shrank from 4013 MB to 35 MB, and 645 MB to 5 KB respectively after a manual reindex. From what I’ve read it seems like GIN indexes are much more prone to this issue.

After a few days, you can see these indexes growing steadily while the main btree fetch index is pretty stable:

               index_name               | index_size 
----------------------------------------+------------
 river_job_state_and_finalized_at_index | 26 MB
 river_job_args_index                   | 121 MB
 river_job_metadata_index               | 24 MB

What about the original concerns about possible slow reindexer calls? Without getting too fancy, it's probably a good idea to put in a check to verify that a reindex for a specific index isn't already running before starting a new one. This could be done potentially with pg_stat_progress_create_index .

Along similar lines, looking at the implementation, it would definitely be better if each reindex was kicked off in sequence rather than all simultaneously. This would require some of the code to be augmented, but it'd be clearly better if it were that way.

I agree these would be improvements, although I'm not certain I'd delay shipping this until we have these optimizations in place. Is that what you'd prefer to do? My thinking is with the default setting of daily reindexes, you'd need to have some serious issues for the reindex to not complete in that timeframe. If you changed them to run every few minutes or seconds, then yeah, it's a lot more risky.

If you think it's important to do something to address these issues before we enable the feature at all, I think we could mitigate most of the risk with just the first option. We could check for ongoing reindexing attempts each time the reindexer is triggered, and exclude those from the run. Staggering the reindex calls (while still maintaining a schedule) sounds a fair bit more complex.

Lmk what you think is appropriate for this stage vs later improvements.

bgentry · 2025-01-16T16:14:43Z

periodic_job.go

+// ScheduleNever returns a PeriodicSchedule that never runs.
+func ScheduleNever() PeriodicSchedule {
+	return &neverSchedule{}
+}


Naming is a little tough here. neverSchedule is consistent with the internal name periodicIntervalSchedule above, but for the external API name I thought ScheduleNever() makes more sense. If we're going to provide one-off schedule options, they should probably be grouped with a common prefix ScheduleXX for discoverability.

None are consistent with PeriodicInterval's name though.

Let me know what you think.

Hrmm, I'm a little mixed on this one. Given that PeriodicSchedule is already an external naming scheme, it may be that NeverSchedule fits better into that, don't you think? "Never schedule" also flows off the tongue a bit easier.

Fair point, renamed to NeverSchedule 👍

bgentry · 2025-01-23T15:15:02Z

@brandur any thoughts on the remaining questions here? It should be ready to go other than the question on naming.

brandur · 2025-01-23T20:59:29Z

After a few days, you can see these indexes growing steadily while the main btree fetch index is pretty stable:

               index_name               | index_size 
----------------------------------------+------------
 river_job_state_and_finalized_at_index | 26 MB
 river_job_args_index                   | 121 MB
 river_job_metadata_index               | 24 MB

Just since I don't think I have access and it's been about a week, can you post these numbers one more time (without reindexing)? I'm curious if the Gist indexes stabilize.

What about the original concerns about possible slow reindexer calls? Without getting too fancy, it's probably a good idea to put in a check to verify that a reindex for a specific index isn't already running before starting a new one. This could be done potentially with pg_stat_progress_create_index .

Along similar lines, looking at the implementation, it would definitely be better if each reindex was kicked off in sequence rather than all simultaneously. This would require some of the code to be augmented, but it'd be clearly better if it were that way.

I agree these would be improvements, although I'm not certain I'd delay shipping this until we have these optimizations in place. Is that what you'd prefer to do? My thinking is with the default setting of daily reindexes, you'd need to have some serious issues for the reindex to not complete in that timeframe. If you changed them to run every few minutes or seconds, then yeah, it's a lot more risky.

Alright. Yeah, I guess they can be pushed.

bgentry · 2025-01-24T02:29:26Z

Now, one week after the last manual run:

 river_job_args_index                   | 258 MB
 river_job_metadata_index               | 55 MB
 river_job_prioritized_fetching_index   | 74 MB

chaporgin · 2025-01-28T15:26:36Z

Thank you for this Pull Request.
In the 2 comments above ¹, ², as I see, the size of the two GIN based indexes was still growing, even when the reindexer was running. Or do I read it wrong?

bgentry · 2025-01-28T15:32:20Z

@chaporgin the reindexer was not running regularly in the above statistics queries—those are from the demo app running v0.15.0 before enabling the reindexer, showing what happens when I then run the reindexer queries manually.

bgentry requested a review from brandur January 13, 2025 16:41

brandur reviewed Jan 15, 2025

View reviewed changes

bgentry force-pushed the bg-reindexer branch from f3ba6d5 to 2c49d59 Compare January 16, 2025 16:12

bgentry commented Jan 16, 2025

View reviewed changes

bgentry force-pushed the bg-reindexer branch from 2c49d59 to a4aa3f2 Compare January 18, 2025 02:20

bgentry requested a review from brandur January 18, 2025 02:21

bgentry force-pushed the bg-reindexer branch from a4aa3f2 to b0c480c Compare January 23, 2025 15:14

brandur approved these changes Jan 23, 2025

View reviewed changes

enable the reindexer by default, add NeverSchedule

0d4494c

bgentry force-pushed the bg-reindexer branch from b0c480c to 0d4494c Compare January 24, 2025 02:31

bgentry enabled auto-merge (squash) January 24, 2025 02:32

bgentry merged commit e296461 into master Jan 24, 2025
10 checks passed

bgentry deleted the bg-reindexer branch January 24, 2025 02:34

bgentry mentioned this pull request Jan 28, 2025

prepare v0.16.0 #735

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the reindexer by default #718

Enable the reindexer by default #718

bgentry commented Jan 13, 2025

brandur left a comment

brandur Jan 15, 2025

bgentry Jan 18, 2025

brandur Jan 15, 2025

bgentry commented Jan 16, 2025

bgentry Jan 16, 2025

brandur Jan 23, 2025

bgentry Jan 24, 2025

bgentry commented Jan 23, 2025

brandur commented Jan 23, 2025

bgentry commented Jan 24, 2025

chaporgin commented Jan 28, 2025

bgentry commented Jan 28, 2025 •

edited

Loading

Enable the reindexer by default #718

Enable the reindexer by default #718

Conversation

bgentry commented Jan 13, 2025

brandur left a comment

Choose a reason for hiding this comment

brandur Jan 15, 2025

Choose a reason for hiding this comment

bgentry Jan 18, 2025

Choose a reason for hiding this comment

brandur Jan 15, 2025

Choose a reason for hiding this comment

bgentry commented Jan 16, 2025

bgentry Jan 16, 2025

Choose a reason for hiding this comment

brandur Jan 23, 2025

Choose a reason for hiding this comment

bgentry Jan 24, 2025

Choose a reason for hiding this comment

bgentry commented Jan 23, 2025

brandur commented Jan 23, 2025

bgentry commented Jan 24, 2025

chaporgin commented Jan 28, 2025

Footnotes

bgentry commented Jan 28, 2025 • edited Loading

bgentry commented Jan 28, 2025 •

edited

Loading