Test DB pool fixes #769

bgentry · 2025-02-18T04:15:13Z

Tested the heck out of this on #763 until I figured out a few issues:

We were allowing NUM_CPU conns to be created in each of the pools for the testdb.Manager, meaning it was actually NUM_CPU * NUM_CPU connections. That's on top of the NUM_CPU connections being allocated for the global TestTx pool. I reduced this to a fixed limit of 4.
We didn't have a minimum set on the pools. There's no reason for any of these pooled test DBs to drop below 1, so I set that as the minimum.
We were always closing pools after every use, which is wasteful and inefficient. After digging, we only had a single test where we were intentionally closing the pool (one for the notifier). Rather than always closing the pool after each test, do a ping operation to make sure it's not closed. If it is closed, then it can be recreated.

The combination of these seems to have caused the flakiness issues that got worse with 1.24. Ultimately we were putting a ton of churn on Postgres conns and creating a lot of them at once. With these changes, not only do the tests pass consistently but the time for the river package on Actions seems to drop from ~40 seconds to ~31 seconds.

bgentry · 2025-02-18T04:16:51Z

internal/cmd/testdbman/main.go

-	// greater. If changing this number, also change the similar value in
-	// `riverinternaltest` where it's duplicated.
-	dbNames := generateTestDBNames(max(4, runtime.NumCPU()))
+	dbNames := generateTestDBNames(runtime.GOMAXPROCS(0))


I'm not sure we actually want to rely on this setting since it could be removed someday. However it's at least somewhat dynamic and can be overridden, whereas runtime.NumCPU() is fixed. My thought was this option is just slightly more reliable if somebody tweaks the GOMAXPROCS and still wants things to work here.

Not set on it though.

bgentry · 2025-02-18T04:20:09Z

internal/riverinternaltest/riverinternaltest.go

+	// Use a smaller number of conns per pool, because otherwise we could have
+	// NUM_CPU pools, each with NUM_CPU connections, and that's a lot of
+	// connections if there are many CPUs.


Actually realized this issue because in some of my tests I was hitting the 100 conn limit on my local Postgres, at least once I stopped closing every pool after every test. 14 cores * 14 conns per pool = 196 potential conns.

I don't think any given test ever needs to use more than 4 in the pool.

brandur

Noice!

bgentry · 2025-02-18T05:25:50Z

Unfortunately I don't think this is a total fix, I've still been seeing occasional errors (though less frequently). But still I think each of these changes is good, so it gets us closer. Will keep chasing down the remaining cause.

I'm not entirely sure _why_ this is necessary on GitHub Actions when it's not at all needed locally. What we continue to see in CI is that sometimes tests time out on trying to acquire a test DB from the pool of test DBs, which is currently capped at GOMAXPROCS count. This _shouldn't_ happen, because that's the same as the number of parallel tests that should be able to run at once, and so we shouldn't have more than that number of DBs in demand from concurrent tests. My hunch is that something is stalling the `TRUNCATE` cleanup actions and causing one or more tests to be blocked on a DB that's been released but not yet returned to the pool because it's undergoing cleanup. Simply increasing the number of DBs to provide a little padding appears to completely eliminate the staging CI flakiness, _and_ it further speeds up the `river` package's tests to 20-24 seconds, rather than the 31 seconds we've been seeing since #769 was merged, or the ~40 seconds we were seeing prior to that.

In recent changes from #769 / fb6b3e9, I inadvertently tweaked the logic of the testdb.Manager to throw away / destroy a resource if it encountered cleanup errors. Destroying the resource has the side effect of causing the manager to try to reallocate a new database with a new number, potentially higher than the number of test databases we've created. Instead, try to just make a new pool on the same database, and only destroy the resource if that fails.

bgentry requested a review from brandur February 18, 2025 04:15

bgentry commented Feb 18, 2025

View reviewed changes

bgentry force-pushed the bg-testdb-pool-fixes branch from b55b147 to 44d1515 Compare February 18, 2025 04:19

bgentry commented Feb 18, 2025

View reviewed changes

bgentry added 5 commits February 17, 2025 22:38

limit number of conns in TestDB pools

8b5afc3

max of GOMAXPROCS dbs from testdbman

4153a4e

minimum of 1 conn per testdb pool

8372cbc

debug logging for testdb.Manager

b7a7169

only re-create db pool if it's already closed

c1f5b53

bgentry force-pushed the bg-testdb-pool-fixes branch from 44d1515 to c1f5b53 Compare February 18, 2025 04:38

brandur approved these changes Feb 18, 2025

View reviewed changes

bgentry merged commit fb6b3e9 into master Feb 18, 2025
10 checks passed

bgentry deleted the bg-testdb-pool-fixes branch February 18, 2025 05:26

bgentry mentioned this pull request Feb 19, 2025

allow GOMAXPROCS + 2 test DBs to fix CI flakiness #776

Merged

bgentry mentioned this pull request Feb 19, 2025

testdb.Manager: try to recreate pool after cleanup error #778

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test DB pool fixes #769

Test DB pool fixes #769

bgentry commented Feb 18, 2025 •

edited

Loading

bgentry Feb 18, 2025

bgentry Feb 18, 2025

brandur left a comment

bgentry commented Feb 18, 2025

Test DB pool fixes #769

Test DB pool fixes #769

Conversation

bgentry commented Feb 18, 2025 • edited Loading

bgentry Feb 18, 2025

Choose a reason for hiding this comment

bgentry Feb 18, 2025

Choose a reason for hiding this comment

brandur left a comment

Choose a reason for hiding this comment

bgentry commented Feb 18, 2025

bgentry commented Feb 18, 2025 •

edited

Loading