Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test DB pool fixes #769

Merged
merged 5 commits into from
Feb 18, 2025
Merged

Test DB pool fixes #769

merged 5 commits into from
Feb 18, 2025

Conversation

bgentry
Copy link
Contributor

@bgentry bgentry commented Feb 18, 2025

Tested the heck out of this on #763 until I figured out a few issues:

  1. We were allowing NUM_CPU conns to be created in each of the pools for the testdb.Manager, meaning it was actually NUM_CPU * NUM_CPU connections. That's on top of the NUM_CPU connections being allocated for the global TestTx pool. I reduced this to a fixed limit of 4.
  2. We didn't have a minimum set on the pools. There's no reason for any of these pooled test DBs to drop below 1, so I set that as the minimum.
  3. We were always closing pools after every use, which is wasteful and inefficient. After digging, we only had a single test where we were intentionally closing the pool (one for the notifier). Rather than always closing the pool after each test, do a ping operation to make sure it's not closed. If it is closed, then it can be recreated.

The combination of these seems to have caused the flakiness issues that got worse with 1.24. Ultimately we were putting a ton of churn on Postgres conns and creating a lot of them at once. With these changes, not only do the tests pass consistently but the time for the river package on Actions seems to drop from ~40 seconds to ~31 seconds.

@bgentry bgentry requested a review from brandur February 18, 2025 04:15
// greater. If changing this number, also change the similar value in
// `riverinternaltest` where it's duplicated.
dbNames := generateTestDBNames(max(4, runtime.NumCPU()))
dbNames := generateTestDBNames(runtime.GOMAXPROCS(0))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we actually want to rely on this setting since it could be removed someday. However it's at least somewhat dynamic and can be overridden, whereas runtime.NumCPU() is fixed. My thought was this option is just slightly more reliable if somebody tweaks the GOMAXPROCS and still wants things to work here.

Not set on it though.

@bgentry bgentry force-pushed the bg-testdb-pool-fixes branch from b55b147 to 44d1515 Compare February 18, 2025 04:19
Comment on lines +284 to +286
// Use a smaller number of conns per pool, because otherwise we could have
// NUM_CPU pools, each with NUM_CPU connections, and that's a lot of
// connections if there are many CPUs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually realized this issue because in some of my tests I was hitting the 100 conn limit on my local Postgres, at least once I stopped closing every pool after every test. 14 cores * 14 conns per pool = 196 potential conns.

I don't think any given test ever needs to use more than 4 in the pool.

@bgentry bgentry force-pushed the bg-testdb-pool-fixes branch from 44d1515 to c1f5b53 Compare February 18, 2025 04:38
Copy link
Contributor

@brandur brandur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noice!

@bgentry
Copy link
Contributor Author

bgentry commented Feb 18, 2025

Unfortunately I don't think this is a total fix, I've still been seeing occasional errors (though less frequently). But still I think each of these changes is good, so it gets us closer. Will keep chasing down the remaining cause.

@bgentry bgentry merged commit fb6b3e9 into master Feb 18, 2025
10 checks passed
@bgentry bgentry deleted the bg-testdb-pool-fixes branch February 18, 2025 05:26
bgentry added a commit that referenced this pull request Feb 19, 2025
I'm not entirely sure _why_ this is necessary on GitHub Actions when
it's not at all needed locally. What we continue to see in CI is that
sometimes tests time out on trying to acquire a test DB from the pool of
test DBs, which is currently capped at GOMAXPROCS count. This
_shouldn't_ happen, because that's the same as the number of parallel
tests that should be able to run at once, and so we shouldn't have more
than that number of DBs in demand from concurrent tests.

My hunch is that something is stalling the `TRUNCATE` cleanup actions
and causing one or more tests to be blocked on a DB that's been released
but not yet returned to the pool because it's undergoing cleanup. Simply
increasing the number of DBs to provide a little padding appears to
completely eliminate the staging CI flakiness, _and_ it further speeds
up the `river` package's tests to 20-24 seconds, rather than the 31
seconds we've been seeing since #769 was merged, or the ~40 seconds we
were seeing prior to that.
bgentry added a commit that referenced this pull request Feb 19, 2025
I'm not entirely sure _why_ this is necessary on GitHub Actions when
it's not at all needed locally. What we continue to see in CI is that
sometimes tests time out on trying to acquire a test DB from the pool of
test DBs, which is currently capped at GOMAXPROCS count. This
_shouldn't_ happen, because that's the same as the number of parallel
tests that should be able to run at once, and so we shouldn't have more
than that number of DBs in demand from concurrent tests.

My hunch is that something is stalling the `TRUNCATE` cleanup actions
and causing one or more tests to be blocked on a DB that's been released
but not yet returned to the pool because it's undergoing cleanup. Simply
increasing the number of DBs to provide a little padding appears to
completely eliminate the staging CI flakiness, _and_ it further speeds
up the `river` package's tests to 20-24 seconds, rather than the 31
seconds we've been seeing since #769 was merged, or the ~40 seconds we
were seeing prior to that.
bgentry added a commit that referenced this pull request Feb 19, 2025
In recent changes from #769 / fb6b3e9, I inadvertently tweaked the logic
of the testdb.Manager to throw away / destroy a resource if it
encountered cleanup errors. Destroying the resource has the side effect
of causing the manager to try to reallocate a new database with a new
number, potentially higher than the number of test databases we've
created.

Instead, try to just make a new pool on the same database, and only
destroy the resource if that fails.
bgentry added a commit that referenced this pull request Feb 19, 2025
In recent changes from #769 / fb6b3e9, I inadvertently tweaked the logic
of the testdb.Manager to throw away / destroy a resource if it
encountered cleanup errors. Destroying the resource has the side effect
of causing the manager to try to reallocate a new database with a new
number, potentially higher than the number of test databases we've
created.

Instead, try to just make a new pool on the same database, and only
destroy the resource if that fails.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants