-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional "context cancelled" dial errors even though we don't use cancellable contexts #522
Comments
I also realise that our use of pgx and pgxpool might be adding complexity here, or even that the cancelled contexts come from them as the pooling logic has some defaults relating to how long the connections should be kept alive for. Still, my understanding is that using pgxpool is tested and supported as it's shown in the README, so we shouldn't be running into issues. |
Thanks for the issue @mvdan. It's possible this is related to how Cloud Run will throttle the CPU when Cloud Run isn't serving requests. Let me take a closer look here to see if are using our contexts properly and report back. |
Thanks! Some more bits of info:
|
To add another data point: even though we have "CPU is only allocated during request processing" enabled, and the service sometimes goes hours without any requests, all SQL queries and statements are only done on goroutines which handle HTTP requests, i.e. the call stack is on top of ServeHTTP. Perhaps the connection pool struggled to keep connections alive while the CPU was starved before handling the request, though. |
My best guess without having done any debugging yet is that the CPU throttling is affecting the background certificate refresh that this Go Connector runs. FWIW we're working on a lazy refresh option so that there are no goroutines running the background. In other words, when a client (or pool) requests a connection, the Go Connector retrieves the certificate right then and there. This would presumably help the Cloud Run use case (where in reality background goroutines do not run reliably). You might try CPU always allocated, but for a low traffic service, the cost probably isn't worth the change. |
That might actually make sense. The errors only happen in the middle of the night as our daily e2e test run happens, likely hours after any other traffic. My repeated attempts at reproducing the flake have failed, but I always did that in the middle of the day, not really letting the service stay idle for long. Let me know when you have a commit for us to try, and we'll report back. We've only seen two failures in three weeks, though, so we might realistically need a whole month before we can convince ourselves that the error is gone for good. |
Sounds good. We're going to be working on a lazy refresh across the entire suite of connectors, so it might be awhile before we have something to try. I'll update here when we have a commit ready. |
Note that we switched our Cloud Run service to "CPU is always allocated" and we are still seeing this error - in fact more often now, multiple times a day, as our traffic is starting to ramp up 😬 It even happened just now as we had deployed a new revision of the service, meaning that the process had only been alive for a few minutes. Has there been any decent testing of alloydb-go-connector being used with pgxpool on top, like I describe in the snippet above, following the README? I'm starting to worry that it's the pooling of connections, or the interaction between pgx or pgxpool with alloydb-go-connector, which is the problem. I don't think CPU allocation is related at all, given the paragraph above. |
Hmm, I owe you an apology - this was all a pretty significant blunder on my part :) Note So, effectively, we were using an open I lowered the pgxpool config lifetimes, including the idle time for connections, and the error happened almost instantly, which is how I narrowed it down. After the fix (not closing the dialer too early), everything seems to work normally. This was 100% a bug in our code, but could I suggest that you make |
I'm so glad you found this. I was feeling very nervous about a lurking bug otherwise. I think we can do two things to close this issue:
Thank you for all the info here. I'm still curious if you'll see occasional errors when CPU is not always allocated. So if you do, feel free to report those. We have lazy refresh planned for the next quarter and will be addressing it across all our connectors (Cloud SQL, AlloyDB across Java, Python, Go, and Node.js). |
Those two steps seem reasonable to me; happy to code review those patches as well, given my recent experience. You're right that I copy pasted from your example, and I completely missed one of the defers. The fact that the code seemed to work during initial testing (and it fact it worked for up to half an hour given the defaults) made me think it was right. We had switched away from only allocating CPU to handle incoming requests to see if that would help with the "idle" errors, but since it didn't, and the bug was this line of code instead, I'll move us back to on-demand CPU again. I assume we won't see any more errors :) |
Sounds good. I have a few high priority items to get to first and will get a PR up for this soon. |
If the dialer has already been closed, return a clear error. Fixes #522
If the dialer has already been closed, return a clear error. Fixes #522
If the dialer has already been closed, return a clear error. Fixes #522
If the dialer has already been closed, return a clear error. Fixes #522
If the dialer has already been closed, return a clear error. Fixes #522
If the dialer has already been closed, return a clear error. Fixes #522
If the dialer has already been closed, return a clear error. Fixes #522
If the dialer has already been closed, return a clear error. Fixes #522
Bug Description
We use an AlloyDB cluster from a Go service in Cloud Run, and it mostly works fine, except that sometimes we get errors such as:
Which doesn't make sense, given that the database connection pool (https://pkg.go.dev/github.com/jackc/pgx/v5/pgxpool) is opened with a background context, and the SQL statement is executed with a TODO context, and neither are cancellable.
The Go program in Cloud Run is connected to AlloyDB via a VPC connector, as outlined in https://cloud.google.com/alloydb/docs/quickstart/integrate-cloud-run.
We use the following versions:
We set up pgxpool with the alloydb connector as best we could, looking at the examples in your documentation. The relevant bits of the code are below; hopefully we didn't make any mistakes.
Example code (or command)
Stacktrace
No response
Steps to reproduce?
We aren't able to reliably reproduce this error. It happens occasionally in our live service; we have seen the error twice in the past two weeks.
Environment
Additional Details
The code isn't in a public repo, apologies. I'm more than happy to try to provide any more information that is requested.
The text was updated successfully, but these errors were encountered: