Occasional "context cancelled" dial errors even though we don't use cancellable contexts #522

mvdan · 2024-03-11T09:49:23Z

Bug Description

We use an AlloyDB cluster from a Go service in Cloud Run, and it mostly works fine, except that sometimes we get errors such as:

failed to connect to host=/tmp user=foo@project-bar.iam database=foo: dial error (context canceled)

Which doesn't make sense, given that the database connection pool (https://pkg.go.dev/github.com/jackc/pgx/v5/pgxpool) is opened with a background context, and the SQL statement is executed with a TODO context, and neither are cancellable.

The Go program in Cloud Run is connected to AlloyDB via a VPC connector, as outlined in https://cloud.google.com/alloydb/docs/quickstart/integrate-cloud-run.

We use the following versions:

go 1.22.0
cloud.google.com/go/alloydbconn v1.7.0
github.com/jackc/pgx/v5 v5.5.3

We set up pgxpool with the alloydb connector as best we could, looking at the examples in your documentation. The relevant bits of the code are below; hopefully we didn't make any mistakes.

Example code (or command)

// opening the database

ctx := context.Background()
config, err := pgxpool.ParseConfig(dbURL)
if err != nil {
	return nil, err
}
d, err := alloydbconn.NewDialer(ctx, alloydbconn.WithIAMAuthN())
if err != nil {
	log.Fatalf("failed to initialize dialer: %v", err)
}
defer d.Close()
config.ConnConfig.DialFunc = func(ctx context.Context, _ string, instance string) (net.Conn, error) {
	return d.Dial(ctx, cfg.AlloyDBInstance)
}
pool, err := pgxpool.NewWithConfig(ctx, config)
if err != nil {
	return nil, err
}
srv.db = pool

// using the db pool to run the statement above which failed only once

ctx := context.TODO()
_, err := srv.db.Exec(ctx, `
	UPDATE repos_cache
	SET updated = '2000-01-01 00:00+00'::timestamptz,
		updated_collaborators = '2000-01-01 00:00+00'::timestamptz
	WHERE provider = $1
	AND owner_name = $2
	AND repo_name = $3
`, repo.provider, repo.ownerName, repo.repoName)
if err != nil {
	return err
}

Stacktrace

No response

Steps to reproduce?

We aren't able to reliably reproduce this error. It happens occasionally in our live service; we have seen the error twice in the past two weeks.

Environment

OS type and version: Linux on Cloud Run, unclear what version it is
Go version: 1.22.0
AlloyDB Go Connector version: v1.7.0

Additional Details

The code isn't in a public repo, apologies. I'm more than happy to try to provide any more information that is requested.

The text was updated successfully, but these errors were encountered:

mvdan · 2024-03-11T09:50:57Z

I also realise that our use of pgx and pgxpool might be adding complexity here, or even that the cancelled contexts come from them as the pooling logic has some defaults relating to how long the connections should be kept alive for. Still, my understanding is that using pgxpool is tested and supported as it's shown in the README, so we shouldn't be running into issues.

enocom · 2024-03-11T13:53:48Z

Thanks for the issue @mvdan. It's possible this is related to how Cloud Run will throttle the CPU when Cloud Run isn't serving requests. Let me take a closer look here to see if are using our contexts properly and report back.

mvdan · 2024-03-11T13:58:08Z

Thanks! Some more bits of info:

The Cloud Run service is configured with 1 vCPU, 2GiB RAM, "CPU is only allocated during request processing", second generation, minimum 1 and maximum 2 instances, startup CPU boost.
The VPC connector is an e2-micro with 2-4 instances
The AlloyDB cluster has a single high-availability instance, 2 vCPU, 16 GiB RAM
The traffic is very low; these network flakes happen overnight as we run the daily end-to-end test jobs. The CPU and RAM usage is overall at or below 10%.
This is all within the same europe-west1 region, so the network roundtrips should be quick and the reliability should be high.

mvdan · 2024-03-12T10:59:41Z

It's possible this is related to how Cloud Run will throttle the CPU when Cloud Run isn't serving requests.

To add another data point: even though we have "CPU is only allocated during request processing" enabled, and the service sometimes goes hours without any requests, all SQL queries and statements are only done on goroutines which handle HTTP requests, i.e. the call stack is on top of ServeHTTP. Perhaps the connection pool struggled to keep connections alive while the CPU was starved before handling the request, though.

enocom · 2024-03-13T21:22:57Z

My best guess without having done any debugging yet is that the CPU throttling is affecting the background certificate refresh that this Go Connector runs.

FWIW we're working on a lazy refresh option so that there are no goroutines running the background. In other words, when a client (or pool) requests a connection, the Go Connector retrieves the certificate right then and there. This would presumably help the Cloud Run use case (where in reality background goroutines do not run reliably).

You might try CPU always allocated, but for a low traffic service, the cost probably isn't worth the change.

mvdan · 2024-03-13T21:37:00Z

That might actually make sense. The errors only happen in the middle of the night as our daily e2e test run happens, likely hours after any other traffic. My repeated attempts at reproducing the flake have failed, but I always did that in the middle of the day, not really letting the service stay idle for long.

Let me know when you have a commit for us to try, and we'll report back. We've only seen two failures in three weeks, though, so we might realistically need a whole month before we can convince ourselves that the error is gone for good.

enocom · 2024-03-18T16:42:26Z

Sounds good. We're going to be working on a lazy refresh across the entire suite of connectors, so it might be awhile before we have something to try. I'll update here when we have a commit ready.

enocom · 2024-03-18T16:42:36Z

cc @ttosta-google

mvdan · 2024-03-22T10:54:31Z

Note that we switched our Cloud Run service to "CPU is always allocated" and we are still seeing this error - in fact more often now, multiple times a day, as our traffic is starting to ramp up 😬 It even happened just now as we had deployed a new revision of the service, meaning that the process had only been alive for a few minutes.

Has there been any decent testing of alloydb-go-connector being used with pgxpool on top, like I describe in the snippet above, following the README? I'm starting to worry that it's the pooling of connections, or the interaction between pgx or pgxpool with alloydb-go-connector, which is the problem. I don't think CPU allocation is related at all, given the paragraph above.

enocom · 2024-03-25T15:07:02Z

Thanks, @mvdan. We do run our test suite using pgxpool (see here) and haven't seen any regular test failures with context cancelled. Nonetheless, I would expect CPU is always allocated to make this problem go away.

Let me take a closer look and get back to you.

mvdan · 2024-03-25T17:06:26Z

Hmm, I owe you an apology - this was all a pretty significant blunder on my part :) Note defer d.Close(), but a lack of defer pool.Close(). The database setup code was all in an "open database" method, and then the SQL statements with pgxpool were run from goroutines handling HTTP requests, long after the "open database" method returned.

So, effectively, we were using an open pgxpool.Pool with a closed alloydb.Dialer. The errors appeared to happen when pgxpool decided a connection was idle (after 30m), and when trying to open a new connection with the dialer, it would fail with the confusing error (not surprisingly).

I lowered the pgxpool config lifetimes, including the idle time for connections, and the error happened almost instantly, which is how I narrowed it down. After the fix (not closing the dialer too early), everything seems to work normally.

This was 100% a bug in our code, but could I suggest that you make Dialer.Dial fail with a very clear error if Dialer.Close was already called? Even an error like "dialer was already closed" would have saved us multiple hours of debugging, as the "context canceled" error is pretty generic in Go and made me think of all sorts of potential causes.

enocom · 2024-03-25T18:55:10Z

I'm so glad you found this. I was feeling very nervous about a lurking bug otherwise.

I think we can do two things to close this issue:

Return a better error as you suggest.
Update the sample to make it copy paste friendly. Right now, a person could reasonably do exactly the same thing you did based on our sample. We can do better here.

Thank you for all the info here. I'm still curious if you'll see occasional errors when CPU is not always allocated. So if you do, feel free to report those. We have lazy refresh planned for the next quarter and will be addressing it across all our connectors (Cloud SQL, AlloyDB across Java, Python, Go, and Node.js).

mvdan · 2024-03-25T21:24:27Z

Those two steps seem reasonable to me; happy to code review those patches as well, given my recent experience. You're right that I copy pasted from your example, and I completely missed one of the defers. The fact that the code seemed to work during initial testing (and it fact it worked for up to half an hour given the defaults) made me think it was right.

We had switched away from only allocating CPU to handle incoming requests to see if that would help with the "idle" errors, but since it didn't, and the bug was this line of code instead, I'll move us back to on-demand CPU again. I assume we won't see any more errors :)

enocom · 2024-03-25T21:25:54Z

Sounds good. I have a few high priority items to get to first and will get a PR up for this soon.

If the dialer has already been closed, return a clear error. Fixes #522

mvdan added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Mar 11, 2024

blunderbuss-gcf bot assigned jackwotherspoon Mar 11, 2024

enocom added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Mar 11, 2024

enocom added type: cleanup An internal cleanup or hygiene concern. type: docs Improvement to the documentation for an API. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Apr 1, 2024

enocom assigned enocom and unassigned jackwotherspoon Apr 1, 2024

enocom added a commit that referenced this issue Apr 3, 2024

fix: return a friendly error if the dialer is closed

623d468

If the dialer has already been closed, return a clear error. Fixes #522

enocom added a commit that referenced this issue Apr 3, 2024

fix: return a friendly error if the dialer is closed

2980334

If the dialer has already been closed, return a clear error. Fixes #522

enocom mentioned this issue Apr 3, 2024

fix: return a friendly error if the dialer is closed #538

Merged

enocom added a commit that referenced this issue Apr 3, 2024

fix: return a friendly error if the dialer is closed

94d2801

If the dialer has already been closed, return a clear error. Fixes #522

enocom added a commit that referenced this issue Apr 3, 2024

fix: return a friendly error if the dialer is closed

f2f1623

If the dialer has already been closed, return a clear error. Fixes #522

enocom added a commit that referenced this issue Apr 3, 2024

fix: return a friendly error if the dialer is closed

e9d41d3

If the dialer has already been closed, return a clear error. Fixes #522

enocom added a commit that referenced this issue Apr 6, 2024

fix: return a friendly error if the dialer is closed

0b63997

If the dialer has already been closed, return a clear error. Fixes #522

enocom added a commit that referenced this issue Apr 8, 2024

fix: return a friendly error if the dialer is closed

5ca97d8

If the dialer has already been closed, return a clear error. Fixes #522

enocom closed this as completed in #538 Apr 9, 2024

enocom added a commit that referenced this issue Apr 9, 2024

fix: return a friendly error if the dialer is closed (#538)

66d7bd0

If the dialer has already been closed, return a clear error. Fixes #522

release-please bot mentioned this issue Apr 9, 2024

chore(main): release 1.9.0 #543

Merged

jackwotherspoon mentioned this issue Apr 19, 2024

dialer is closed error GoogleCloudPlatform/cloud-sql-go-connector#777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional "context cancelled" dial errors even though we don't use cancellable contexts #522

Occasional "context cancelled" dial errors even though we don't use cancellable contexts #522

mvdan commented Mar 11, 2024

mvdan commented Mar 11, 2024

enocom commented Mar 11, 2024

mvdan commented Mar 11, 2024 •

edited

Loading

mvdan commented Mar 12, 2024

enocom commented Mar 13, 2024

mvdan commented Mar 13, 2024

enocom commented Mar 18, 2024

enocom commented Mar 18, 2024

mvdan commented Mar 22, 2024

enocom commented Mar 25, 2024

mvdan commented Mar 25, 2024

enocom commented Mar 25, 2024

mvdan commented Mar 25, 2024

enocom commented Mar 25, 2024

Occasional "context cancelled" dial errors even though we don't use cancellable contexts #522

Occasional "context cancelled" dial errors even though we don't use cancellable contexts #522

Comments

mvdan commented Mar 11, 2024

Bug Description

Example code (or command)

Stacktrace

Steps to reproduce?

Environment

Additional Details

mvdan commented Mar 11, 2024

enocom commented Mar 11, 2024

mvdan commented Mar 11, 2024 • edited Loading

mvdan commented Mar 12, 2024

enocom commented Mar 13, 2024

mvdan commented Mar 13, 2024

enocom commented Mar 18, 2024

enocom commented Mar 18, 2024

mvdan commented Mar 22, 2024

enocom commented Mar 25, 2024

mvdan commented Mar 25, 2024

enocom commented Mar 25, 2024

mvdan commented Mar 25, 2024

enocom commented Mar 25, 2024

mvdan commented Mar 11, 2024 •

edited

Loading