Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API-side query caching for job counts #108

Merged
merged 1 commit into from
Aug 13, 2024
Merged

Conversation

brandur
Copy link
Collaborator

@brandur brandur commented Aug 7, 2024

This one's related to trying to find a solution for #106. After messing
around with query plans a lot on the "count by state" query, I came to
the conclusion in the end that Postgres might actually be doing the
right thing by falling back to a sequential scan, or at least only the
minimally wrong thing. Even forcing the count query to run against a
well-used index is a fairly slow operation when there are many jobs in
the database. It's hard to provide specifics because caching affects the
result so much (so running the same query twice in a row can produce
vastly different timings), but I've seen the index version take longer
than the seq scan in some cases.

So here, I'm proposing a radically different solution in which we add
some infrastructure to the River UI API server that lets it run slow
queries periodically in the background, then have API endpoints take
advantage of those cached results instead of having to run each
operation themselves, thereby making their responses ~instant.

I've written it such that this caching only kicks in when we know we're
working with a very large data set where it actually matters (currently
defined as > 1M rows), with the idea being that for smaller databases
we'll continue to run queries in-band so that results look as fresh and
real-time as possible.

To support this, I've had to make some changes to the River UI API
server/handler so that it has a Start function that can be invoked to
start background utilities like the query cache. It's a considerable
change, but I think it leaves us in a more sustainable place API-wise
because we may want to add other background utilities later on, and
returning an http.Handler isn't enough because even if you were to
start goroutines from NewHandler, it's very, very not ideal that
there's no way to stop those goroutines again (problematic for anything
that wants to check for leaks with goleak).

I'm also going to propose that we increase the default API endpoint
timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job
rows, I see count queries taking right around that 3 to 5 seconds range.
Since the original number of 5 seconds was a little arbitrary anyway, it
can't hurt to give those queries a little more leeway. A problem that
could still occur even with my proposal here is that if a user starts
River UI and then immediately hits the UI, there won't be a cached
results yet, and therefore the count query will go to the database
directly, and that may still cause a timeout at 5 seconds.

I've only applied caching to the count timeout so far, but I've written
the QueryCacher code such that it can cleanly support other queries if
we care to add them.

@brandur brandur force-pushed the brandur-query-cache branch from 7b11868 to 6baf4d6 Compare August 7, 2024 00:43
@brandur brandur requested a review from bgentry August 7, 2024 00:43
@brandur
Copy link
Collaborator Author

brandur commented Aug 7, 2024

@bgentry Wanted to get your rough thoughts on this approach. Here's what the service looks like in action:

$ go build ./cmd/riverui && RIVER_DEBUG=true ./riverui
2024/08/06 17:43:13 starting server on :8080
time=2024-08-06T17:43:24.306-07:00 level=DEBUG msg="QueryCacher[[]*dbsqlc.JobCountByStateRow]: Ran query and cached result" duration=38.686833ms tick_period=10.298205226s
time=2024-08-06T17:43:34.566-07:00 level=DEBUG msg="QueryCacher[[]*dbsqlc.JobCountByStateRow]: Ran query and cached result" duration=888.291µs tick_period=10.298205226s
time=2024-08-06T17:43:44.864-07:00 level=DEBUG msg="QueryCacher[[]*dbsqlc.JobCountByStateRow]: Ran query and cached result" duration=806µs tick_period=10.298205226s

Copy link
Contributor

@bgentry bgentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looking forward to giving this a shot.

The demo app is currently just below the threshold that would trigger this, wondering if we should nudge up its job frequency so that it's above the threshold 🤔

go.mod Outdated
@@ -1,6 +1,6 @@
module github.com/riverqueue/riverui

go 1.22
go 1.22.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to make tweaks to this based on what we've just learned, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to 1.22. We have to keep this one on 1.22+ though because we're using Go 1.22's new HTTP routing features.

// Start starts the server's background services. Notably, this does _not_ cause
// the server to start listening for HTTP in any way. To do that, call Handler
// and mount or run it using Go's built in `net/http`.
func (s *Server) Start(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably need some structure like this in the API & demo apps too 🤔

Comment on lines -2 to -11
SELECT
state,
count(*)
FROM
river_job
WHERE
queue IS NOT NULL AND
priority > 0 AND
scheduled_at IS NOT NULL AND
id IS NOT NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was originally in here in an attempt to nudge Postgres into taking advantage of the prioritized fetching index for this query, but tbh it may not have been the best way to do that. I assume you've ensured the index still gets used on large tables even after this change?

Copy link
Collaborator Author

@brandur brandur Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, the index definitely does not get used (unless you reindex) — that's what's driving this whole change. A seq scan was reported in #106, and it turns out that the seq scan is mostly right because an index scan is approximately equivalent in cost, so Postgres' decision is correct. The NOT NULLs are what already exist in the existing release and it happens either way.

return nil
}

// Simplies the name of a Go type that uses generics for cleaner logging output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Simplies the name of a Go type that uses generics for cleaner logging output.
// Simplifies the name of a Go type that uses generics for cleaner logging output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx.

@brandur brandur force-pushed the brandur-query-cache branch 5 times, most recently from c00c3cf to aac7a3e Compare August 13, 2024 02:59
@brandur
Copy link
Collaborator Author

brandur commented Aug 13, 2024

The demo app is currently just below the threshold that would trigger this, wondering if we should nudge up its job frequency so that it's above the threshold 🤔

Yes! Totally down for this. It'd probably help suss out other performance problems too. I guess if that's hard we could also consider reducing the 1M threshold to 750k since it's somewhat arbitrary anyway.

This one's related to trying to find a solution for #106. After messing
around with query plans a lot on the "count by state" query, I came to
the conclusion in the end that Postgres might actually be doing the
right thing by falling back to a sequential scan, or at least only the
minimally wrong thing. Even forcing the count query to run against a
well-used index is a fairly slow operation when there are many jobs in
the database. It's hard to provide specifics because caching affects the
result so much (so running the same query twice in a row can produce
vastly different timings), but I've seen the index version take _longer_
than the seq scan in some cases.

So here, I'm proposing a radically different solution in which we add
some infrastructure to the River UI API server that lets it run slow
queries periodically in the background, then have API endpoints take
advantage of those cached results instead of having to run each
operation themselves, thereby making their responses ~instant.

I've written it such that this caching only kicks in when we know we're
working with a very large data set where it actually matters (currently
defined as > 1M rows), with the idea being that for smaller databases
we'll continue to run queries in-band so that results look as fresh and
real-time as possible.

To support this, I've had to make some changes to the River UI API
server/handler so that it has a `Start` function that can be invoked to
start background utilities like the query cache. It's a considerable
change, but I think it leaves us in a more sustainable place API-wise
because we may want to add other background utilities later on, and
returning an `http.Handler` isn't enough because even if you were to
start goroutines from `NewHandler`, it's very, very not ideal that
there's no way to stop those goroutines again (problematic for anything
that wants to check for leaks with goleak).

I'm also going to propose that we increase the default API endpoint
timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job
rows, I see count queries taking right around that 3 to 5 seconds range.
Since the original number of 5 seconds was a little arbitrary anyway, it
can't hurt to give those queries a little more leeway. A problem that
could still occur even with my proposal here is that if a user starts
River UI and then immediately hits the UI, there won't be a cached
results yet, and therefore the count query will go to the database
directly, and that may still cause a timeout at 5 seconds.

I've only applied caching to the count timeout so far, but I've written
the `QueryCacher` code such that it can cleanly support other queries if
we care to add them.
@brandur brandur force-pushed the brandur-query-cache branch from aac7a3e to 82b4b45 Compare August 13, 2024 03:07
@brandur
Copy link
Collaborator Author

brandur commented Aug 13, 2024

Thanks! Addressed comments and also added a changelog.

@brandur brandur merged commit 7088258 into master Aug 13, 2024
10 checks passed
@brandur brandur deleted the brandur-query-cache branch August 13, 2024 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants