Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long delays on first taskrun caused by aws-sdk-go #4087

Closed
ghost opened this issue Jul 12, 2021 · 14 comments · Fixed by #4504
Closed

Long delays on first taskrun caused by aws-sdk-go #4087

ghost opened this issue Jul 12, 2021 · 14 comments · Fixed by #4504
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@ghost
Copy link

ghost commented Jul 12, 2021

This is a tracking issue for a problem we're experiencing with a transitive dependency of Tekton Pipelines. aws-sdk-go appears to cause very long delays (~10 minutes) starting the first TaskRun of a freshly deployed Pipelines Controller under certain conditions.

Identifying the Problem

The issue manifests for users as a very long delay during the startup of a TaskRun. The status of the TaskRun will remain completely unpopulated for a very long time and then, with a default timeout of 10 minutes, it will eventually fail. See Issue 3627 and Issue 4084 for two examples of this problem showing up in user-facing contexts.

This problem has also shown up during development in my Kind cluster when attempting to execute a TaskRun with a Task from a Tekton Bundle. A combination of debug logging and the go runtime.trace package helped isolate the issue to a single call to k8schain.New. The below trace snippet shows 8 sequential network connection attempts lasting 75s each. The connections appear to originate in the aws-sdk-go package. These occurred during a single TaskRun execution that utilized a Task from a Tekton Bundle:

0.000348249	 .     1494	198	region k8schain.New started (duration: 10m2.20341841s)
0.005701440	 .  5353191	198	new goroutine 260: github.com/aws/aws-sdk-go/internal/sync/singleflight.(*Group).doCall
0.005805444	 .   104004	260	new goroutine 261: net/http.(*Transport).dialConnFor
75.329407376	75.323601932	260	new goroutine 464: net/http.(*Transport).dialConnFor
150.583420726	75.254013350	260	new goroutine 785: net/http.(*Transport).dialConnFor
225.953546323	75.370125597	260	new goroutine 991: net/http.(*Transport).dialConnFor
301.129792264	75.176245941	260	new goroutine 1258: net/http.(*Transport).dialConnFor
376.398534187	75.268741923	260	new goroutine 1497: net/http.(*Transport).dialConnFor
451.715118090	75.316583903	260	new goroutine 1809: net/http.(*Transport).dialConnFor
527.040350774	75.325232684	260	new goroutine 2069: net/http.(*Transport).dialConnFor
602.203773864	75.163423090	198	total-k8schain-time=10m2.203418063s
GC:22.952305ms

Apparent Root Cause

Searching around for supporting evidence the following issue comment (and many posts across different projects) shows up: zalando-incubator/kube-ingress-aws-controller#309 (comment)

Most of my own understanding of what's going on is from that issue comment, as well as an AWS blog post (see the section "Protecting against open layer 3 firewalls and NATs") and the description of IP TTL at https://en.wikipedia.org/wiki/Time_to_live#IP_packets.

In summary:

The AWS go SDK introduced support for a new metadata service. One of the security features is that the TTL of the requests sent to the metadata service is set to 1. That TTL works great from AWS machines but not from anywhere else. With a TTL so low the networking machinery will drop the packet after only a single hop. Tekton Pipelines indirectly attempts to initialize the AWS SDK due to its dependency on go-containerregistry / k8schain / k8s-pkg-credentialprovider. This happens, for example, when a bundle is fetched or an image's entrypoint is looked up from a registry. So Pipelines issues whatever request to the AWS SDK and those packets are almost immediately dropped by the network. I assume it then re-attempts the requests several times, resulting in the 8 75s-long requests in my trace above.

What remains slightly confusing from my perspective: Why doesn't this result in an error returned to Tekton Pipelines when the packets are dropped? According to the blog post and wikipedia article linked above we should expect to receive an ICMP error datagram when the TTL hits 0 without reaching its destination. But it doesn't seem like we do or if we do then it's ignored by one of our dependencies. Looking again at the trace above there is one request initially that seems to fail very quickly but then there are 8 subsequent connection attempts that each take a long time to close.

Short-Term Mitigation

In the short term we can work around this problem by supplying false AWS credentials in the Pipeline Controller deployment. This seems to short-circuit the connections that the AWS SDK is making on Tekton Pipeline's behalf. Confirmed here and here.

Solving the Problem

In the long term it seems like options are:

  1. Hope that AWS folks fix this process. @imjasonh has opened a PR against k8s-pkg-credentialprovider to update the aws-sdk-go dependency to see if that resolves this particular problem.
  2. Maybe introduce a way to skip initialization of the AWS SDK in k8s-pkg-credentialprovider? A new env var or similar that the controller deployment can include?
  3. Build Tekton with support for the aws provider disabled.
@vdemeester
Copy link
Member

There is also ways to build tekton without this feature built-in, similarly to what we do with google cloud in openshift pipelines. There is some built tag to be able to disable those credential helpers in the go-containerregistry library.

@sbwsg does this happen only for OCI bundle or also on task that do not specify a command ? (A take with both command and arts shouldn’t run into this big I feel)

@ghost
Copy link
Author

ghost commented Jul 12, 2021

Nice, I will add this as another option, thanks @vdemeester !

It also happens on tasks that don't specify a command - #4084 is an example of this. Looks like it doesn't affect tasks which don't hit the credential helpers, like those with a command specified.

@bobcatfish bobcatfish added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jul 15, 2021
@vdemeester
Copy link
Member

Nice, I will add this as another option, thanks @vdemeester !

It also happens on tasks that don't specify a command - #4084 is an example of this. Looks like it doesn't affect tasks which don't hit the credential helpers, like those with a command specified.

right so this is definitely very similar to the issue we are hitting with openshift and gcp 😅

@ghost
Copy link
Author

ghost commented Jul 16, 2021

right so this is definitely very similar to the issue we are hitting with openshift and gcp 😅

How do you manage it in openshift? build your own version of the credential provider without gcp support?

@imjasonh
Copy link
Member

imjasonh commented Jul 16, 2021

How do you manage it in openshift? build your own version of the credential provider without gcp support?

Yeah, OpenShift Pipelines builds with go build ./... -tags=disable_gcp, which causes this file to be excluded from compilation.

You can do the same with -tags=disable_aws to disable this file.

@ghost
Copy link
Author

ghost commented Jul 16, 2021

I wonder why the gcp timeout issue doesn't happen in my kind cluster (not running on gcp or google hardware) the same way the aws one does.

@vdemeester
Copy link
Member

I wonder why the gcp timeout issue doesn't happen in my kind cluster (not running on gcp or google hardware) the same way the aws one does.

Because of the way openshift disallow the network operation. On kind it’s probably getting a 404 or something really quickly, whereas openshift let the network hang it seems (not sure about the details..)

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2021
@bobcatfish
Copy link
Collaborator

We definitely want to avoid long delays on a first taskrun esp. if they are caused by a dependency that might not even be in use. I think ultimately that TEP-0060 remote resource resolution would be a way of addressing this (or maybe it would just move the problem?)

It seems to me like this issue should block #3661 - on the other hand, maybe the short term workaround @sbwsg added is enough to consider this closed?

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2021
@ghost
Copy link
Author

ghost commented Oct 26, 2021

I don't think TEP-0060 is going to provide us any cover here unfortunately. When a command is omitted from a step pipelines is required to look up the entrypoint and hits the slowdown, unrelated to resolution or bundles.

@vdemeester I just tried pulling in vdemeester/k8s-pkg-credentialprovider with the updated aws sdk @imjasonh committed back in July but saw the following error:

 (main) λ go get github.com/vdemeester/k8s-pkg-credentialprovider@4ed9eb75b43aa88d09d416da7512442299996845
go: downloading github.com/vdemeester/k8s-pkg-credentialprovider v1.21.1-0.20210712215949-4ed9eb75b43a
go get: github.com/vdemeester/k8s-pkg-credentialprovider@v1.21.1-0.20210712215949-4ed9eb75b43a requires
	k8s.io/kubelet@v0.0.0: reading k8s.io/kubelet/go.mod at revision v0.0.0: unknown revision v0.0.0

Is that something you've seen before with k8s-pkg-credentialprovider? Any way to workaround it you know of?

Edit:

Bit more investingating. This import error is particularly weird because the k8s-pkg-credentialprovider fork includes explicit replace directives for all the kubernetes dependencies.

So I tried bringing those replace directives over but am not able to get builds to succeed right now.

@vdemeester
Copy link
Member

Oh.. we may need to fix the master branch though.. 😓 There shouldn't be any replace directives there.. nor k8s.io/kubernetes v1.21.0 dependency…
See on 1.21

I need to check on something.. 🙃

@ghost
Copy link
Author

ghost commented Jan 6, 2022

It looks as though go.mod in k8s-pkg-credentialprovider 1.22 has dropped the replace directives. Maybe ready to bring in to Tekton Pipelines - @vdemeester wdyt?

I've just tried updating to the 1.22 commit hash in our go.mod but couldn't get compilation working - it seems like maybe some of the ./hack generation scripts collide with versions pulled in here. I'll try again tomorrow.

@ghost
Copy link
Author

ghost commented Jan 6, 2022

/assign

@tekton-robot tekton-robot assigned ghost Jan 6, 2022
@imjasonh
Copy link
Member

imjasonh commented Jan 6, 2022

Possibly of interest, we're undergoing a bit of a refactor in ggcr around how various cloud auth providers are made available -- see google/go-containerregistry#1231 (and google/go-containerregistry#1234)

Basically, we all hate k8schain and its massive k8s client-go and Vincent's-fork-of-credproviders dependencies, and we'd like to come up with something better.

In doing so we might accidentally fix the aws-sdk-go startup slowness 🤞

This was referenced Jan 18, 2022
tekton-robot pushed a commit that referenced this issue Jan 21, 2022
I tested the changes @imjasonh made in adf030 without these fake aws
credentials and it does appear that the ten minutes slow down we were
observing in #4087 are fixed.

This PR backs out the phony AWS creds that were in place on the
controller to work around long delays on initial image fetches for
entrypoint lookup (and bundles) related to k8schain.
khrm pushed a commit to openshift/tektoncd-pipeline that referenced this issue May 19, 2022
I tested the changes @imjasonh made in adf030 without these fake aws
credentials and it does appear that the ten minutes slow down we were
observing in tektoncd#4087 are fixed.

This PR backs out the phony AWS creds that were in place on the
controller to work around long delays on initial image fetches for
entrypoint lookup (and bundles) related to k8schain.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants