Long delays on first taskrun caused by aws-sdk-go #4087

ghost · 2021-07-12T15:12:51Z

This is a tracking issue for a problem we're experiencing with a transitive dependency of Tekton Pipelines. aws-sdk-go appears to cause very long delays (~10 minutes) starting the first TaskRun of a freshly deployed Pipelines Controller under certain conditions.

Identifying the Problem

The issue manifests for users as a very long delay during the startup of a TaskRun. The status of the TaskRun will remain completely unpopulated for a very long time and then, with a default timeout of 10 minutes, it will eventually fail. See Issue 3627 and Issue 4084 for two examples of this problem showing up in user-facing contexts.

This problem has also shown up during development in my Kind cluster when attempting to execute a TaskRun with a Task from a Tekton Bundle. A combination of debug logging and the go runtime.trace package helped isolate the issue to a single call to k8schain.New. The below trace snippet shows 8 sequential network connection attempts lasting 75s each. The connections appear to originate in the aws-sdk-go package. These occurred during a single TaskRun execution that utilized a Task from a Tekton Bundle:

0.000348249	 .     1494	198	region k8schain.New started (duration: 10m2.20341841s)
0.005701440	 .  5353191	198	new goroutine 260: github.com/aws/aws-sdk-go/internal/sync/singleflight.(*Group).doCall
0.005805444	 .   104004	260	new goroutine 261: net/http.(*Transport).dialConnFor
75.329407376	75.323601932	260	new goroutine 464: net/http.(*Transport).dialConnFor
150.583420726	75.254013350	260	new goroutine 785: net/http.(*Transport).dialConnFor
225.953546323	75.370125597	260	new goroutine 991: net/http.(*Transport).dialConnFor
301.129792264	75.176245941	260	new goroutine 1258: net/http.(*Transport).dialConnFor
376.398534187	75.268741923	260	new goroutine 1497: net/http.(*Transport).dialConnFor
451.715118090	75.316583903	260	new goroutine 1809: net/http.(*Transport).dialConnFor
527.040350774	75.325232684	260	new goroutine 2069: net/http.(*Transport).dialConnFor
602.203773864	75.163423090	198	total-k8schain-time=10m2.203418063s
GC:22.952305ms

Apparent Root Cause

Searching around for supporting evidence the following issue comment (and many posts across different projects) shows up: zalando-incubator/kube-ingress-aws-controller#309 (comment)

Most of my own understanding of what's going on is from that issue comment, as well as an AWS blog post (see the section "Protecting against open layer 3 firewalls and NATs") and the description of IP TTL at https://en.wikipedia.org/wiki/Time_to_live#IP_packets.

In summary:

The AWS go SDK introduced support for a new metadata service. One of the security features is that the TTL of the requests sent to the metadata service is set to 1. That TTL works great from AWS machines but not from anywhere else. With a TTL so low the networking machinery will drop the packet after only a single hop. Tekton Pipelines indirectly attempts to initialize the AWS SDK due to its dependency on go-containerregistry / k8schain / k8s-pkg-credentialprovider. This happens, for example, when a bundle is fetched or an image's entrypoint is looked up from a registry. So Pipelines issues whatever request to the AWS SDK and those packets are almost immediately dropped by the network. I assume it then re-attempts the requests several times, resulting in the 8 75s-long requests in my trace above.

What remains slightly confusing from my perspective: Why doesn't this result in an error returned to Tekton Pipelines when the packets are dropped? According to the blog post and wikipedia article linked above we should expect to receive an ICMP error datagram when the TTL hits 0 without reaching its destination. But it doesn't seem like we do or if we do then it's ignored by one of our dependencies. Looking again at the trace above there is one request initially that seems to fail very quickly but then there are 8 subsequent connection attempts that each take a long time to close.

Short-Term Mitigation

In the short term we can work around this problem by supplying false AWS credentials in the Pipeline Controller deployment. This seems to short-circuit the connections that the AWS SDK is making on Tekton Pipeline's behalf. Confirmed here and here.

Solving the Problem

In the long term it seems like options are:

Hope that AWS folks fix this process. @imjasonh has opened a PR against k8s-pkg-credentialprovider to update the aws-sdk-go dependency to see if that resolves this particular problem.
Maybe introduce a way to skip initialization of the AWS SDK in k8s-pkg-credentialprovider? A new env var or similar that the controller deployment can include?
Build Tekton with support for the aws provider disabled.

The text was updated successfully, but these errors were encountered:

vdemeester · 2021-07-12T21:55:29Z

There is also ways to build tekton without this feature built-in, similarly to what we do with google cloud in openshift pipelines. There is some built tag to be able to disable those credential helpers in the go-containerregistry library.

@sbwsg does this happen only for OCI bundle or also on task that do not specify a command ? (A take with both command and arts shouldn’t run into this big I feel)

ghost · 2021-07-12T22:15:27Z

Nice, I will add this as another option, thanks @vdemeester !

It also happens on tasks that don't specify a command - #4084 is an example of this. Looks like it doesn't affect tasks which don't hit the credential helpers, like those with a command specified.

vdemeester · 2021-07-15T21:55:31Z

Nice, I will add this as another option, thanks @vdemeester !

It also happens on tasks that don't specify a command - #4084 is an example of this. Looks like it doesn't affect tasks which don't hit the credential helpers, like those with a command specified.

right so this is definitely very similar to the issue we are hitting with openshift and gcp 😅

ghost · 2021-07-16T13:09:47Z

right so this is definitely very similar to the issue we are hitting with openshift and gcp 😅

How do you manage it in openshift? build your own version of the credential provider without gcp support?

imjasonh · 2021-07-16T14:21:23Z

How do you manage it in openshift? build your own version of the credential provider without gcp support?

Yeah, OpenShift Pipelines builds with go build ./... -tags=disable_gcp, which causes this file to be excluded from compilation.

You can do the same with -tags=disable_aws to disable this file.

ghost · 2021-07-16T14:49:29Z

I wonder why the gcp timeout issue doesn't happen in my kind cluster (not running on gcp or google hardware) the same way the aws one does.

vdemeester · 2021-07-16T17:52:57Z

I wonder why the gcp timeout issue doesn't happen in my kind cluster (not running on gcp or google hardware) the same way the aws one does.

Because of the way openshift disallow the network operation. On kind it’s probably getting a 404 or something really quickly, whereas openshift let the network hang it seems (not sure about the details..)

tekton-robot · 2021-10-15T19:03:29Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

bobcatfish · 2021-10-18T17:25:57Z

We definitely want to avoid long delays on a first taskrun esp. if they are caused by a dependency that might not even be in use. I think ultimately that TEP-0060 remote resource resolution would be a way of addressing this (or maybe it would just move the problem?)

It seems to me like this issue should block #3661 - on the other hand, maybe the short term workaround @sbwsg added is enough to consider this closed?

/remove-lifecycle stale

ghost · 2021-10-26T13:47:29Z

I don't think TEP-0060 is going to provide us any cover here unfortunately. When a command is omitted from a step pipelines is required to look up the entrypoint and hits the slowdown, unrelated to resolution or bundles.

@vdemeester I just tried pulling in vdemeester/k8s-pkg-credentialprovider with the updated aws sdk @imjasonh committed back in July but saw the following error:

 (main) λ go get github.com/vdemeester/k8s-pkg-credentialprovider@4ed9eb75b43aa88d09d416da7512442299996845
go: downloading github.com/vdemeester/k8s-pkg-credentialprovider v1.21.1-0.20210712215949-4ed9eb75b43a
go get: github.com/vdemeester/k8s-pkg-credentialprovider@v1.21.1-0.20210712215949-4ed9eb75b43a requires
	k8s.io/kubelet@v0.0.0: reading k8s.io/kubelet/go.mod at revision v0.0.0: unknown revision v0.0.0

Is that something you've seen before with k8s-pkg-credentialprovider? Any way to workaround it you know of?

Edit:

Bit more investingating. This import error is particularly weird because the k8s-pkg-credentialprovider fork includes explicit replace directives for all the kubernetes dependencies.

So I tried bringing those replace directives over but am not able to get builds to succeed right now.

vdemeester · 2021-10-27T15:59:30Z

Oh.. we may need to fix the master branch though.. 😓 There shouldn't be any replace directives there.. nor k8s.io/kubernetes v1.21.0 dependency…
See on 1.21

I need to check on something.. 🙃

ghost · 2022-01-06T14:49:01Z

It looks as though go.mod in k8s-pkg-credentialprovider 1.22 has dropped the replace directives. Maybe ready to bring in to Tekton Pipelines - @vdemeester wdyt?

I've just tried updating to the 1.22 commit hash in our go.mod but couldn't get compilation working - it seems like maybe some of the ./hack generation scripts collide with versions pulled in here. I'll try again tomorrow.

ghost · 2022-01-06T15:13:32Z

/assign

imjasonh · 2022-01-06T15:52:32Z

Possibly of interest, we're undergoing a bit of a refactor in ggcr around how various cloud auth providers are made available -- see google/go-containerregistry#1231 (and google/go-containerregistry#1234)

Basically, we all hate k8schain and its massive k8s client-go and Vincent's-fork-of-credproviders dependencies, and we'd like to come up with something better.

In doing so we might accidentally fix the aws-sdk-go startup slowness 🤞

@imjasonh

I tested the changes @imjasonh made in adf030 without these fake aws credentials and it does appear that the ten minutes slow down we were observing in #4087 are fixed. This PR backs out the phony AWS creds that were in place on the controller to work around long delays on initial image fetches for entrypoint lookup (and bundles) related to k8schain.

@imjasonh

I tested the changes @imjasonh made in adf030 without these fake aws credentials and it does appear that the ten minutes slow down we were observing in tektoncd#4087 are fixed. This PR backs out the phony AWS creds that were in place on the controller to work around long delays on initial image fetches for entrypoint lookup (and bundles) related to k8schain.

ghost mentioned this issue Jul 12, 2021

Set fake AWS credentials on controller to workaround aws-sdk bug #4073

Merged

3 tasks

bobcatfish added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jul 15, 2021

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2021

tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2021

markbastiaans mentioned this issue Dec 28, 2021

translating TaskSpec to Pod: error getting image manifest #3655

Closed

tekton-robot assigned ghost Jan 6, 2022

imjasonh mentioned this issue Jan 7, 2022

Add pkg/authn/kubernetes google/go-containerregistry#1234

Merged

This was referenced Jan 18, 2022

Update k8schain #4488

Merged

Revert 49a7fa to remove fake AWS creds #4504

Merged

tekton-robot closed this as completed in #4504 Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long delays on first taskrun caused by aws-sdk-go #4087

Long delays on first taskrun caused by aws-sdk-go #4087

ghost commented Jul 12, 2021 •

edited by ghost

Loading

vdemeester commented Jul 12, 2021

ghost commented Jul 12, 2021

vdemeester commented Jul 15, 2021

ghost commented Jul 16, 2021

imjasonh commented Jul 16, 2021 •

edited

Loading

ghost commented Jul 16, 2021

vdemeester commented Jul 16, 2021

tekton-robot commented Oct 15, 2021

bobcatfish commented Oct 18, 2021

ghost commented Oct 26, 2021 •

edited by ghost

Loading

vdemeester commented Oct 27, 2021

ghost commented Jan 6, 2022

ghost commented Jan 6, 2022

imjasonh commented Jan 6, 2022 •

edited

Loading

Long delays on first taskrun caused by aws-sdk-go #4087

Long delays on first taskrun caused by aws-sdk-go #4087

Comments

ghost commented Jul 12, 2021 • edited by ghost Loading

Identifying the Problem

Apparent Root Cause

Short-Term Mitigation

Solving the Problem

vdemeester commented Jul 12, 2021

ghost commented Jul 12, 2021

vdemeester commented Jul 15, 2021

ghost commented Jul 16, 2021

imjasonh commented Jul 16, 2021 • edited Loading

ghost commented Jul 16, 2021

vdemeester commented Jul 16, 2021

tekton-robot commented Oct 15, 2021

bobcatfish commented Oct 18, 2021

ghost commented Oct 26, 2021 • edited by ghost Loading

vdemeester commented Oct 27, 2021

ghost commented Jan 6, 2022

ghost commented Jan 6, 2022

imjasonh commented Jan 6, 2022 • edited Loading

ghost commented Jul 12, 2021 •

edited by ghost

Loading

imjasonh commented Jul 16, 2021 •

edited

Loading

ghost commented Oct 26, 2021 •

edited by ghost

Loading

imjasonh commented Jan 6, 2022 •

edited

Loading