Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the reindexer by default #718

Merged
merged 1 commit into from
Jan 24, 2025
Merged

Enable the reindexer by default #718

merged 1 commit into from
Jan 24, 2025

Conversation

bgentry
Copy link
Contributor

@bgentry bgentry commented Jan 13, 2025

Following the questions and discussion in #717, this PR enables the reindexer by default. It was originally disabled in #34 in the lead up to release because we felt we may want to spend more time addressing potential downsides (like multiple simultaneous reindexings).

Just from looking at the demo app, it's clear that the GIN indexes are particularly prone to long-term bloat. Pulling from what I posted in that discussion, here's the relevant index sizes before and after a manual concurrent reindex call:

riverdemo=# SELECT
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes;
               index_name               | index_size 
----------------------------------------+------------
 river_job_prioritized_fetching_index   | 80 MB
 river_job_args_index                   | 4013 MB
 river_job_metadata_index               | 645 MB
...
riverdemo=# SELECT                                    
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes;
               index_name               | index_size 
----------------------------------------+------------
 river_job_prioritized_fetching_index   | 38 MB
 river_job_args_index                   | 35 MB
 river_job_metadata_index               | 5656 kB
...

This is likely the reason the Fly deployment where the demo is running keeps running out of disk space on its Postgres instance. There might be situations where river_job_prioritized_fetching_index would also benefit from reindexing, but it's not clear that it needs it to the same extent. Perhaps with a higher throughput use case (i.e. Heroku's API per #59) it might matter.

@bgentry bgentry requested a review from brandur January 13, 2025 16:41
Copy link
Contributor

@brandur brandur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple thoughts:

  • Two data points on index size tell you something, but when evaluating the value of something like this, it'd be better to know about index growth over time instead (even if it were three data points instead of two). Indexes tend to degrade from their pristine state fairly quickly, but it's not a big problem unless they continue to degrade. In the demo app's case, I wonder if it might grow to hit 80 MB, but then basically become stable there.
  • What about the original concerns about possible slow reindexer calls? Without getting too fancy, it's probably a good idea to put in a check to verify that a reindex for a specific index isn't already running before starting a new one. This could be done potentially with pg_stat_progress_create_index .
    • Along similar lines, looking at the implementation, it would definitely be better if each reindex was kicked off in sequence rather than all simultaneously. This would require some of the code to be augmented, but it'd be clearly better if it were that way.
  • This is definitely in the territory of non-trivial risk, so we should probably need to have a way of turning it off. I wonder if the package should export a NeverSchedule or something like that which you could pass to ReindexerSchedule to disable it.

client.go Outdated
// ReindexerIndexes is a list of indexes to reindex on each run of the
// reindexer. If empty, defaults to only the args and metadata GIN indexes
// (river_job_args_index and river_job_metadata_index).
ReindexerIndexes []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially given that no one's asking for this, IMO better to leave internal for now. Just improves future flexibility and I can't think of a good reason to have a non-default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason would be if you want to add an additional index to the default list, like if you were doing enough throughput that river_job_prioritized_fetching_index was seeing a lot of bloat. But it can definitely go away for now and be re-added later when requested.

CHANGELOG.md Outdated Show resolved Hide resolved
client_test.go Outdated
@@ -3656,6 +3656,17 @@ func Test_Client_Maintenance(t *testing.T) {
})
}

type runOnceScheduler struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this a "schedule" instead of a "scheduler" given the existing naming of types like PeriodicSchedule and DefaultReindexerSchedule.

client_test.go Show resolved Hide resolved
@bgentry
Copy link
Contributor Author

bgentry commented Jan 16, 2025

  • This is definitely in the territory of non-trivial risk, so we should probably need to have a way of turning it off. I wonder if the package should export a NeverSchedule or something like that which you could pass to ReindexerSchedule to disable it.

Hah, I actually meant to add a “never schedule” to this PR but forgot that item before I pushed it. That’s exactly what I was thinking for disabling it. The other option could be to pass a non nil empty list of indexes, but that seemed a bit more error prone. The other benefit of the never schedule design is we should be able to apply that same design to other maintenance services to disable them.

Will get that added.

In the demo app's case, I wonder if it might grow to hit 80 MB, but then basically become stable there.

This may not have been clear enough from my original PR but I wanted to make sure you saw that I did not include the ~80 MB river_job_prioritized_fetching_index in the default list here, specifically because I believe the b-tree indexes are generally quite well maintained by auto-vacuum. The ones I did include here are the args and metadata GIN indexes, which shrank from 4013 MB to 35 MB, and 645 MB to 5 KB respectively after a manual reindex. From what I’ve read it seems like GIN indexes are much more prone to this issue.

After a few days, you can see these indexes growing steadily while the main btree fetch index is pretty stable:

               index_name               | index_size 
----------------------------------------+------------
 river_job_state_and_finalized_at_index | 26 MB
 river_job_args_index                   | 121 MB
 river_job_metadata_index               | 24 MB
  • What about the original concerns about possible slow reindexer calls? Without getting too fancy, it's probably a good idea to put in a check to verify that a reindex for a specific index isn't already running before starting a new one. This could be done potentially with pg_stat_progress_create_index .

    • Along similar lines, looking at the implementation, it would definitely be better if each reindex was kicked off in sequence rather than all simultaneously. This would require some of the code to be augmented, but it'd be clearly better if it were that way.

I agree these would be improvements, although I'm not certain I'd delay shipping this until we have these optimizations in place. Is that what you'd prefer to do? My thinking is with the default setting of daily reindexes, you'd need to have some serious issues for the reindex to not complete in that timeframe. If you changed them to run every few minutes or seconds, then yeah, it's a lot more risky.

If you think it's important to do something to address these issues before we enable the feature at all, I think we could mitigate most of the risk with just the first option. We could check for ongoing reindexing attempts each time the reindexer is triggered, and exclude those from the run. Staggering the reindex calls (while still maintaining a schedule) sounds a fair bit more complex.

Lmk what you think is appropriate for this stage vs later improvements.

periodic_job.go Outdated
Comment on lines 97 to 100
// ScheduleNever returns a PeriodicSchedule that never runs.
func ScheduleNever() PeriodicSchedule {
return &neverSchedule{}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is a little tough here. neverSchedule is consistent with the internal name periodicIntervalSchedule above, but for the external API name I thought ScheduleNever() makes more sense. If we're going to provide one-off schedule options, they should probably be grouped with a common prefix ScheduleXX for discoverability.

None are consistent with PeriodicInterval's name though.

Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I'm a little mixed on this one. Given that PeriodicSchedule is already an external naming scheme, it may be that NeverSchedule fits better into that, don't you think? "Never schedule" also flows off the tongue a bit easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, renamed to NeverSchedule 👍

@bgentry
Copy link
Contributor Author

bgentry commented Jan 23, 2025

@brandur any thoughts on the remaining questions here? It should be ready to go other than the question on naming.

@brandur
Copy link
Contributor

brandur commented Jan 23, 2025

After a few days, you can see these indexes growing steadily while the main btree fetch index is pretty stable:

               index_name               | index_size 
----------------------------------------+------------
 river_job_state_and_finalized_at_index | 26 MB
 river_job_args_index                   | 121 MB
 river_job_metadata_index               | 24 MB

Just since I don't think I have access and it's been about a week, can you post these numbers one more time (without reindexing)? I'm curious if the Gist indexes stabilize.

  • What about the original concerns about possible slow reindexer calls? Without getting too fancy, it's probably a good idea to put in a check to verify that a reindex for a specific index isn't already running before starting a new one. This could be done potentially with pg_stat_progress_create_index .

    • Along similar lines, looking at the implementation, it would definitely be better if each reindex was kicked off in sequence rather than all simultaneously. This would require some of the code to be augmented, but it'd be clearly better if it were that way.

I agree these would be improvements, although I'm not certain I'd delay shipping this until we have these optimizations in place. Is that what you'd prefer to do? My thinking is with the default setting of daily reindexes, you'd need to have some serious issues for the reindex to not complete in that timeframe. If you changed them to run every few minutes or seconds, then yeah, it's a lot more risky.

Alright. Yeah, I guess they can be pushed.

@bgentry
Copy link
Contributor Author

bgentry commented Jan 24, 2025

Now, one week after the last manual run:

 river_job_args_index                   | 258 MB
 river_job_metadata_index               | 55 MB
 river_job_prioritized_fetching_index   | 74 MB

@bgentry bgentry enabled auto-merge (squash) January 24, 2025 02:32
@bgentry bgentry merged commit e296461 into master Jan 24, 2025
10 checks passed
@bgentry bgentry deleted the bg-reindexer branch January 24, 2025 02:34
@bgentry bgentry mentioned this pull request Jan 28, 2025
@chaporgin
Copy link

Thank you for this Pull Request.
In the 2 comments above 1, 2, as I see, the size of the two GIN based indexes was still growing, even when the reindexer was running. Or do I read it wrong?

Footnotes

  1. https://github.com/riverqueue/river/pull/718#issuecomment-2595892059

  2. https://github.com/riverqueue/river/pull/718#issuecomment-2611412578

@bgentry
Copy link
Contributor Author

bgentry commented Jan 28, 2025

@chaporgin the reindexer was not running regularly in the above statistics queries—those are from the demo app running v0.15.0 before enabling the reindexer, showing what happens when I then run the reindexer queries manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants