-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API-side query caching for job counts #108
Conversation
7b11868
to
6baf4d6
Compare
@bgentry Wanted to get your rough thoughts on this approach. Here's what the service looks like in action:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Looking forward to giving this a shot.
The demo app is currently just below the threshold that would trigger this, wondering if we should nudge up its job frequency so that it's above the threshold 🤔
go.mod
Outdated
@@ -1,6 +1,6 @@ | |||
module github.com/riverqueue/riverui | |||
|
|||
go 1.22 | |||
go 1.22.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to make tweaks to this based on what we've just learned, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to 1.22. We have to keep this one on 1.22+ though because we're using Go 1.22's new HTTP routing features.
// Start starts the server's background services. Notably, this does _not_ cause | ||
// the server to start listening for HTTP in any way. To do that, call Handler | ||
// and mount or run it using Go's built in `net/http`. | ||
func (s *Server) Start(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably need some structure like this in the API & demo apps too 🤔
SELECT | ||
state, | ||
count(*) | ||
FROM | ||
river_job | ||
WHERE | ||
queue IS NOT NULL AND | ||
priority > 0 AND | ||
scheduled_at IS NOT NULL AND | ||
id IS NOT NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was originally in here in an attempt to nudge Postgres into taking advantage of the prioritized fetching index for this query, but tbh it may not have been the best way to do that. I assume you've ensured the index still gets used on large tables even after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, the index definitely does not get used (unless you reindex) — that's what's driving this whole change. A seq scan was reported in #106, and it turns out that the seq scan is mostly right because an index scan is approximately equivalent in cost, so Postgres' decision is correct. The NOT NULL
s are what already exist in the existing release and it happens either way.
internal/querycacher/query_cacher.go
Outdated
return nil | ||
} | ||
|
||
// Simplies the name of a Go type that uses generics for cleaner logging output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Simplies the name of a Go type that uses generics for cleaner logging output. | |
// Simplifies the name of a Go type that uses generics for cleaner logging output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx.
c00c3cf
to
aac7a3e
Compare
Yes! Totally down for this. It'd probably help suss out other performance problems too. I guess if that's hard we could also consider reducing the 1M threshold to 750k since it's somewhat arbitrary anyway. |
This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.
aac7a3e
to
82b4b45
Compare
Thanks! Addressed comments and also added a changelog. |
This one's related to trying to find a solution for #106. After messing
around with query plans a lot on the "count by state" query, I came to
the conclusion in the end that Postgres might actually be doing the
right thing by falling back to a sequential scan, or at least only the
minimally wrong thing. Even forcing the count query to run against a
well-used index is a fairly slow operation when there are many jobs in
the database. It's hard to provide specifics because caching affects the
result so much (so running the same query twice in a row can produce
vastly different timings), but I've seen the index version take longer
than the seq scan in some cases.
So here, I'm proposing a radically different solution in which we add
some infrastructure to the River UI API server that lets it run slow
queries periodically in the background, then have API endpoints take
advantage of those cached results instead of having to run each
operation themselves, thereby making their responses ~instant.
I've written it such that this caching only kicks in when we know we're
working with a very large data set where it actually matters (currently
defined as > 1M rows), with the idea being that for smaller databases
we'll continue to run queries in-band so that results look as fresh and
real-time as possible.
To support this, I've had to make some changes to the River UI API
server/handler so that it has a
Start
function that can be invoked tostart background utilities like the query cache. It's a considerable
change, but I think it leaves us in a more sustainable place API-wise
because we may want to add other background utilities later on, and
returning an
http.Handler
isn't enough because even if you were tostart goroutines from
NewHandler
, it's very, very not ideal thatthere's no way to stop those goroutines again (problematic for anything
that wants to check for leaks with goleak).
I'm also going to propose that we increase the default API endpoint
timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job
rows, I see count queries taking right around that 3 to 5 seconds range.
Since the original number of 5 seconds was a little arbitrary anyway, it
can't hurt to give those queries a little more leeway. A problem that
could still occur even with my proposal here is that if a user starts
River UI and then immediately hits the UI, there won't be a cached
results yet, and therefore the count query will go to the database
directly, and that may still cause a timeout at 5 seconds.
I've only applied caching to the count timeout so far, but I've written
the
QueryCacher
code such that it can cleanly support other queries ifwe care to add them.