-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test DB pool fixes #769
Test DB pool fixes #769
Conversation
// greater. If changing this number, also change the similar value in | ||
// `riverinternaltest` where it's duplicated. | ||
dbNames := generateTestDBNames(max(4, runtime.NumCPU())) | ||
dbNames := generateTestDBNames(runtime.GOMAXPROCS(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we actually want to rely on this setting since it could be removed someday. However it's at least somewhat dynamic and can be overridden, whereas runtime.NumCPU()
is fixed. My thought was this option is just slightly more reliable if somebody tweaks the GOMAXPROCS
and still wants things to work here.
Not set on it though.
b55b147
to
44d1515
Compare
// Use a smaller number of conns per pool, because otherwise we could have | ||
// NUM_CPU pools, each with NUM_CPU connections, and that's a lot of | ||
// connections if there are many CPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually realized this issue because in some of my tests I was hitting the 100 conn limit on my local Postgres, at least once I stopped closing every pool after every test. 14 cores * 14 conns per pool = 196 potential conns.
I don't think any given test ever needs to use more than 4 in the pool.
44d1515
to
c1f5b53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noice!
Unfortunately I don't think this is a total fix, I've still been seeing occasional errors (though less frequently). But still I think each of these changes is good, so it gets us closer. Will keep chasing down the remaining cause. |
I'm not entirely sure _why_ this is necessary on GitHub Actions when it's not at all needed locally. What we continue to see in CI is that sometimes tests time out on trying to acquire a test DB from the pool of test DBs, which is currently capped at GOMAXPROCS count. This _shouldn't_ happen, because that's the same as the number of parallel tests that should be able to run at once, and so we shouldn't have more than that number of DBs in demand from concurrent tests. My hunch is that something is stalling the `TRUNCATE` cleanup actions and causing one or more tests to be blocked on a DB that's been released but not yet returned to the pool because it's undergoing cleanup. Simply increasing the number of DBs to provide a little padding appears to completely eliminate the staging CI flakiness, _and_ it further speeds up the `river` package's tests to 20-24 seconds, rather than the 31 seconds we've been seeing since #769 was merged, or the ~40 seconds we were seeing prior to that.
I'm not entirely sure _why_ this is necessary on GitHub Actions when it's not at all needed locally. What we continue to see in CI is that sometimes tests time out on trying to acquire a test DB from the pool of test DBs, which is currently capped at GOMAXPROCS count. This _shouldn't_ happen, because that's the same as the number of parallel tests that should be able to run at once, and so we shouldn't have more than that number of DBs in demand from concurrent tests. My hunch is that something is stalling the `TRUNCATE` cleanup actions and causing one or more tests to be blocked on a DB that's been released but not yet returned to the pool because it's undergoing cleanup. Simply increasing the number of DBs to provide a little padding appears to completely eliminate the staging CI flakiness, _and_ it further speeds up the `river` package's tests to 20-24 seconds, rather than the 31 seconds we've been seeing since #769 was merged, or the ~40 seconds we were seeing prior to that.
In recent changes from #769 / fb6b3e9, I inadvertently tweaked the logic of the testdb.Manager to throw away / destroy a resource if it encountered cleanup errors. Destroying the resource has the side effect of causing the manager to try to reallocate a new database with a new number, potentially higher than the number of test databases we've created. Instead, try to just make a new pool on the same database, and only destroy the resource if that fails.
In recent changes from #769 / fb6b3e9, I inadvertently tweaked the logic of the testdb.Manager to throw away / destroy a resource if it encountered cleanup errors. Destroying the resource has the side effect of causing the manager to try to reallocate a new database with a new number, potentially higher than the number of test databases we've created. Instead, try to just make a new pool on the same database, and only destroy the resource if that fails.
Tested the heck out of this on #763 until I figured out a few issues:
NUM_CPU
conns to be created in each of the pools for thetestdb.Manager
, meaning it was actuallyNUM_CPU * NUM_CPU
connections. That's on top of theNUM_CPU
connections being allocated for the globalTestTx
pool. I reduced this to a fixed limit of 4.The combination of these seems to have caused the flakiness issues that got worse with 1.24. Ultimately we were putting a ton of churn on Postgres conns and creating a lot of them at once. With these changes, not only do the tests pass consistently but the time for the
river
package on Actions seems to drop from ~40 seconds to ~31 seconds.