Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional "context cancelled" dial errors even though we don't use cancellable contexts #522

Closed
mvdan opened this issue Mar 11, 2024 · 14 comments · Fixed by #538 or #543
Closed

Occasional "context cancelled" dial errors even though we don't use cancellable contexts #522

mvdan opened this issue Mar 11, 2024 · 14 comments · Fixed by #538 or #543
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: cleanup An internal cleanup or hygiene concern. type: docs Improvement to the documentation for an API.

Comments

@mvdan
Copy link

mvdan commented Mar 11, 2024

Bug Description

We use an AlloyDB cluster from a Go service in Cloud Run, and it mostly works fine, except that sometimes we get errors such as:

failed to connect to host=/tmp user=foo@project-bar.iam database=foo: dial error (context canceled)

Which doesn't make sense, given that the database connection pool (https://pkg.go.dev/github.com/jackc/pgx/v5/pgxpool) is opened with a background context, and the SQL statement is executed with a TODO context, and neither are cancellable.

The Go program in Cloud Run is connected to AlloyDB via a VPC connector, as outlined in https://cloud.google.com/alloydb/docs/quickstart/integrate-cloud-run.

We use the following versions:

  • go 1.22.0
  • cloud.google.com/go/alloydbconn v1.7.0
  • github.com/jackc/pgx/v5 v5.5.3

We set up pgxpool with the alloydb connector as best we could, looking at the examples in your documentation. The relevant bits of the code are below; hopefully we didn't make any mistakes.

Example code (or command)

// opening the database

ctx := context.Background()
config, err := pgxpool.ParseConfig(dbURL)
if err != nil {
	return nil, err
}
d, err := alloydbconn.NewDialer(ctx, alloydbconn.WithIAMAuthN())
if err != nil {
	log.Fatalf("failed to initialize dialer: %v", err)
}
defer d.Close()
config.ConnConfig.DialFunc = func(ctx context.Context, _ string, instance string) (net.Conn, error) {
	return d.Dial(ctx, cfg.AlloyDBInstance)
}
pool, err := pgxpool.NewWithConfig(ctx, config)
if err != nil {
	return nil, err
}
srv.db = pool

// using the db pool to run the statement above which failed only once

ctx := context.TODO()
_, err := srv.db.Exec(ctx, `
	UPDATE repos_cache
	SET updated = '2000-01-01 00:00+00'::timestamptz,
		updated_collaborators = '2000-01-01 00:00+00'::timestamptz
	WHERE provider = $1
	AND owner_name = $2
	AND repo_name = $3
`, repo.provider, repo.ownerName, repo.repoName)
if err != nil {
	return err
}

Stacktrace

No response

Steps to reproduce?

We aren't able to reliably reproduce this error. It happens occasionally in our live service; we have seen the error twice in the past two weeks.

Environment

  1. OS type and version: Linux on Cloud Run, unclear what version it is
  2. Go version: 1.22.0
  3. AlloyDB Go Connector version: v1.7.0

Additional Details

The code isn't in a public repo, apologies. I'm more than happy to try to provide any more information that is requested.

@mvdan mvdan added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Mar 11, 2024
@mvdan
Copy link
Author

mvdan commented Mar 11, 2024

I also realise that our use of pgx and pgxpool might be adding complexity here, or even that the cancelled contexts come from them as the pooling logic has some defaults relating to how long the connections should be kept alive for. Still, my understanding is that using pgxpool is tested and supported as it's shown in the README, so we shouldn't be running into issues.

@enocom enocom added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Mar 11, 2024
@enocom
Copy link
Member

enocom commented Mar 11, 2024

Thanks for the issue @mvdan. It's possible this is related to how Cloud Run will throttle the CPU when Cloud Run isn't serving requests. Let me take a closer look here to see if are using our contexts properly and report back.

@mvdan
Copy link
Author

mvdan commented Mar 11, 2024

Thanks! Some more bits of info:

  • The Cloud Run service is configured with 1 vCPU, 2GiB RAM, "CPU is only allocated during request processing", second generation, minimum 1 and maximum 2 instances, startup CPU boost.
  • The VPC connector is an e2-micro with 2-4 instances
  • The AlloyDB cluster has a single high-availability instance, 2 vCPU, 16 GiB RAM
  • The traffic is very low; these network flakes happen overnight as we run the daily end-to-end test jobs. The CPU and RAM usage is overall at or below 10%.
  • This is all within the same europe-west1 region, so the network roundtrips should be quick and the reliability should be high.

@mvdan
Copy link
Author

mvdan commented Mar 12, 2024

It's possible this is related to how Cloud Run will throttle the CPU when Cloud Run isn't serving requests.

To add another data point: even though we have "CPU is only allocated during request processing" enabled, and the service sometimes goes hours without any requests, all SQL queries and statements are only done on goroutines which handle HTTP requests, i.e. the call stack is on top of ServeHTTP. Perhaps the connection pool struggled to keep connections alive while the CPU was starved before handling the request, though.

@enocom
Copy link
Member

enocom commented Mar 13, 2024

My best guess without having done any debugging yet is that the CPU throttling is affecting the background certificate refresh that this Go Connector runs.

FWIW we're working on a lazy refresh option so that there are no goroutines running the background. In other words, when a client (or pool) requests a connection, the Go Connector retrieves the certificate right then and there. This would presumably help the Cloud Run use case (where in reality background goroutines do not run reliably).

You might try CPU always allocated, but for a low traffic service, the cost probably isn't worth the change.

@mvdan
Copy link
Author

mvdan commented Mar 13, 2024

That might actually make sense. The errors only happen in the middle of the night as our daily e2e test run happens, likely hours after any other traffic. My repeated attempts at reproducing the flake have failed, but I always did that in the middle of the day, not really letting the service stay idle for long.

Let me know when you have a commit for us to try, and we'll report back. We've only seen two failures in three weeks, though, so we might realistically need a whole month before we can convince ourselves that the error is gone for good.

@enocom
Copy link
Member

enocom commented Mar 18, 2024

Sounds good. We're going to be working on a lazy refresh across the entire suite of connectors, so it might be awhile before we have something to try. I'll update here when we have a commit ready.

@enocom
Copy link
Member

enocom commented Mar 18, 2024

cc @ttosta-google

@mvdan
Copy link
Author

mvdan commented Mar 22, 2024

Note that we switched our Cloud Run service to "CPU is always allocated" and we are still seeing this error - in fact more often now, multiple times a day, as our traffic is starting to ramp up 😬 It even happened just now as we had deployed a new revision of the service, meaning that the process had only been alive for a few minutes.

Has there been any decent testing of alloydb-go-connector being used with pgxpool on top, like I describe in the snippet above, following the README? I'm starting to worry that it's the pooling of connections, or the interaction between pgx or pgxpool with alloydb-go-connector, which is the problem. I don't think CPU allocation is related at all, given the paragraph above.

@enocom
Copy link
Member

enocom commented Mar 25, 2024

Thanks, @mvdan. We do run our test suite using pgxpool (see here) and haven't seen any regular test failures with context cancelled. Nonetheless, I would expect CPU is always allocated to make this problem go away.

Let me take a closer look and get back to you.

@mvdan
Copy link
Author

mvdan commented Mar 25, 2024

Hmm, I owe you an apology - this was all a pretty significant blunder on my part :) Note defer d.Close(), but a lack of defer pool.Close(). The database setup code was all in an "open database" method, and then the SQL statements with pgxpool were run from goroutines handling HTTP requests, long after the "open database" method returned.

So, effectively, we were using an open pgxpool.Pool with a closed alloydb.Dialer. The errors appeared to happen when pgxpool decided a connection was idle (after 30m), and when trying to open a new connection with the dialer, it would fail with the confusing error (not surprisingly).

I lowered the pgxpool config lifetimes, including the idle time for connections, and the error happened almost instantly, which is how I narrowed it down. After the fix (not closing the dialer too early), everything seems to work normally.

This was 100% a bug in our code, but could I suggest that you make Dialer.Dial fail with a very clear error if Dialer.Close was already called? Even an error like "dialer was already closed" would have saved us multiple hours of debugging, as the "context canceled" error is pretty generic in Go and made me think of all sorts of potential causes.

@enocom
Copy link
Member

enocom commented Mar 25, 2024

I'm so glad you found this. I was feeling very nervous about a lurking bug otherwise.

I think we can do two things to close this issue:

  1. Return a better error as you suggest.
  2. Update the sample to make it copy paste friendly. Right now, a person could reasonably do exactly the same thing you did based on our sample. We can do better here.

Thank you for all the info here. I'm still curious if you'll see occasional errors when CPU is not always allocated. So if you do, feel free to report those. We have lazy refresh planned for the next quarter and will be addressing it across all our connectors (Cloud SQL, AlloyDB across Java, Python, Go, and Node.js).

@mvdan
Copy link
Author

mvdan commented Mar 25, 2024

Those two steps seem reasonable to me; happy to code review those patches as well, given my recent experience. You're right that I copy pasted from your example, and I completely missed one of the defers. The fact that the code seemed to work during initial testing (and it fact it worked for up to half an hour given the defaults) made me think it was right.

We had switched away from only allocating CPU to handle incoming requests to see if that would help with the "idle" errors, but since it didn't, and the bug was this line of code instead, I'll move us back to on-demand CPU again. I assume we won't see any more errors :)

@enocom
Copy link
Member

enocom commented Mar 25, 2024

Sounds good. I have a few high priority items to get to first and will get a PR up for this soon.

@enocom enocom added type: cleanup An internal cleanup or hygiene concern. type: docs Improvement to the documentation for an API. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Apr 1, 2024
@enocom enocom assigned enocom and unassigned jackwotherspoon Apr 1, 2024
enocom added a commit that referenced this issue Apr 3, 2024
If the dialer has already been closed, return a clear error.

Fixes #522
enocom added a commit that referenced this issue Apr 3, 2024
If the dialer has already been closed, return a clear error.

Fixes #522
enocom added a commit that referenced this issue Apr 3, 2024
If the dialer has already been closed, return a clear error.

Fixes #522
enocom added a commit that referenced this issue Apr 3, 2024
If the dialer has already been closed, return a clear error.

Fixes #522
enocom added a commit that referenced this issue Apr 3, 2024
If the dialer has already been closed, return a clear error.

Fixes #522
enocom added a commit that referenced this issue Apr 6, 2024
If the dialer has already been closed, return a clear error.

Fixes #522
enocom added a commit that referenced this issue Apr 8, 2024
If the dialer has already been closed, return a clear error.

Fixes #522
enocom added a commit that referenced this issue Apr 9, 2024
If the dialer has already been closed, return a clear error.

Fixes #522
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: cleanup An internal cleanup or hygiene concern. type: docs Improvement to the documentation for an API.
Projects
None yet
3 participants