API-side query caching for job counts #108

brandur · 2024-08-07T00:42:22Z

This one's related to trying to find a solution for #106. After messing
around with query plans a lot on the "count by state" query, I came to
the conclusion in the end that Postgres might actually be doing the
right thing by falling back to a sequential scan, or at least only the
minimally wrong thing. Even forcing the count query to run against a
well-used index is a fairly slow operation when there are many jobs in
the database. It's hard to provide specifics because caching affects the
result so much (so running the same query twice in a row can produce
vastly different timings), but I've seen the index version take longer
than the seq scan in some cases.

So here, I'm proposing a radically different solution in which we add
some infrastructure to the River UI API server that lets it run slow
queries periodically in the background, then have API endpoints take
advantage of those cached results instead of having to run each
operation themselves, thereby making their responses ~instant.

I've written it such that this caching only kicks in when we know we're
working with a very large data set where it actually matters (currently
defined as > 1M rows), with the idea being that for smaller databases
we'll continue to run queries in-band so that results look as fresh and
real-time as possible.

To support this, I've had to make some changes to the River UI API
server/handler so that it has a Start function that can be invoked to
start background utilities like the query cache. It's a considerable
change, but I think it leaves us in a more sustainable place API-wise
because we may want to add other background utilities later on, and
returning an http.Handler isn't enough because even if you were to
start goroutines from NewHandler, it's very, very not ideal that
there's no way to stop those goroutines again (problematic for anything
that wants to check for leaks with goleak).

I'm also going to propose that we increase the default API endpoint
timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job
rows, I see count queries taking right around that 3 to 5 seconds range.
Since the original number of 5 seconds was a little arbitrary anyway, it
can't hurt to give those queries a little more leeway. A problem that
could still occur even with my proposal here is that if a user starts
River UI and then immediately hits the UI, there won't be a cached
results yet, and therefore the count query will go to the database
directly, and that may still cause a timeout at 5 seconds.

I've only applied caching to the count timeout so far, but I've written
the QueryCacher code such that it can cleanly support other queries if
we care to add them.

brandur · 2024-08-07T00:43:59Z

@bgentry Wanted to get your rough thoughts on this approach. Here's what the service looks like in action:

$ go build ./cmd/riverui && RIVER_DEBUG=true ./riverui
2024/08/06 17:43:13 starting server on :8080
time=2024-08-06T17:43:24.306-07:00 level=DEBUG msg="QueryCacher[[]*dbsqlc.JobCountByStateRow]: Ran query and cached result" duration=38.686833ms tick_period=10.298205226s
time=2024-08-06T17:43:34.566-07:00 level=DEBUG msg="QueryCacher[[]*dbsqlc.JobCountByStateRow]: Ran query and cached result" duration=888.291µs tick_period=10.298205226s
time=2024-08-06T17:43:44.864-07:00 level=DEBUG msg="QueryCacher[[]*dbsqlc.JobCountByStateRow]: Ran query and cached result" duration=806µs tick_period=10.298205226s

bgentry

Nice! Looking forward to giving this a shot.

The demo app is currently just below the threshold that would trigger this, wondering if we should nudge up its job frequency so that it's above the threshold 🤔

bgentry · 2024-08-12T21:08:02Z

go.mod

@@ -1,6 +1,6 @@
 module github.com/riverqueue/riverui

-go 1.22
+go 1.22.5


need to make tweaks to this based on what we've just learned, right?

Changed it to 1.22. We have to keep this one on 1.22+ though because we're using Go 1.22's new HTTP routing features.

bgentry · 2024-08-12T21:08:49Z

handler.go

+// Start starts the server's background services. Notably, this does _not_ cause
+// the server to start listening for HTTP in any way. To do that, call Handler
+// and mount or run it using Go's built in `net/http`.
+func (s *Server) Start(ctx context.Context) error {


I probably need some structure like this in the API & demo apps too 🤔

bgentry · 2024-08-12T21:12:09Z

internal/dbsqlc/query.sql

-SELECT
-  state,
-  count(*)
-FROM
-  river_job
-WHERE
-  queue IS NOT NULL AND
-  priority > 0 AND
-  scheduled_at IS NOT NULL AND
-  id IS NOT NULL


I think this was originally in here in an attempt to nudge Postgres into taking advantage of the prioritized fetching index for this query, but tbh it may not have been the best way to do that. I assume you've ensured the index still gets used on large tables even after this change?

Oh no, the index definitely does not get used (unless you reindex) — that's what's driving this whole change. A seq scan was reported in #106, and it turns out that the seq scan is mostly right because an index scan is approximately equivalent in cost, so Postgres' decision is correct. The NOT NULLs are what already exist in the existing release and it happens either way.

bgentry · 2024-08-12T21:13:00Z

internal/querycacher/query_cacher.go

+	return nil
+}
+
+// Simplies the name of a Go type that uses generics for cleaner logging output.


Suggested change

// Simplies the name of a Go type that uses generics for cleaner logging output.

// Simplifies the name of a Go type that uses generics for cleaner logging output.

brandur · 2024-08-13T03:01:52Z

The demo app is currently just below the threshold that would trigger this, wondering if we should nudge up its job frequency so that it's above the threshold 🤔

Yes! Totally down for this. It'd probably help suss out other performance problems too. I guess if that's hard we could also consider reducing the 1M threshold to 750k since it's somewhat arbitrary anyway.

This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.

brandur · 2024-08-13T03:10:09Z

Thanks! Addressed comments and also added a changelog.

brandur force-pushed the brandur-query-cache branch from 7b11868 to 6baf4d6 Compare August 7, 2024 00:43

brandur requested a review from bgentry August 7, 2024 00:43

bgentry approved these changes Aug 12, 2024

View reviewed changes

brandur force-pushed the brandur-query-cache branch 5 times, most recently from c00c3cf to aac7a3e Compare August 13, 2024 02:59

brandur force-pushed the brandur-query-cache branch from aac7a3e to 82b4b45 Compare August 13, 2024 03:07

brandur merged commit 7088258 into master Aug 13, 2024
10 checks passed

brandur deleted the brandur-query-cache branch August 13, 2024 03:10

This was referenced Aug 26, 2024

JobCountByState query performance #106

Open

naming mismatch: HandlerOpts vs NewServer / Server #130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API-side query caching for job counts #108

API-side query caching for job counts #108

brandur commented Aug 7, 2024

brandur commented Aug 7, 2024

bgentry left a comment

bgentry Aug 12, 2024

brandur Aug 13, 2024

bgentry Aug 12, 2024

bgentry Aug 12, 2024

brandur Aug 13, 2024 •

edited

Loading

bgentry Aug 12, 2024

brandur Aug 13, 2024

brandur commented Aug 13, 2024

brandur commented Aug 13, 2024

	// Simplies the name of a Go type that uses generics for cleaner logging output.
	// Simplifies the name of a Go type that uses generics for cleaner logging output.

API-side query caching for job counts #108

API-side query caching for job counts #108

Conversation

brandur commented Aug 7, 2024

brandur commented Aug 7, 2024

bgentry left a comment

Choose a reason for hiding this comment

bgentry Aug 12, 2024

Choose a reason for hiding this comment

brandur Aug 13, 2024

Choose a reason for hiding this comment

bgentry Aug 12, 2024

Choose a reason for hiding this comment

bgentry Aug 12, 2024

Choose a reason for hiding this comment

brandur Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

bgentry Aug 12, 2024

Choose a reason for hiding this comment

brandur Aug 13, 2024

Choose a reason for hiding this comment

brandur commented Aug 13, 2024

brandur commented Aug 13, 2024

brandur Aug 13, 2024 •

edited

Loading