-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long delays on first taskrun caused by aws-sdk-go #4087
Comments
There is also ways to build tekton without this feature built-in, similarly to what we do with google cloud in openshift pipelines. There is some built tag to be able to disable those credential helpers in the go-containerregistry library. @sbwsg does this happen only for OCI bundle or also on task that do not specify a command ? (A take with both command and arts shouldn’t run into this big I feel) |
Nice, I will add this as another option, thanks @vdemeester ! It also happens on tasks that don't specify a command - #4084 is an example of this. Looks like it doesn't affect tasks which don't hit the credential helpers, like those with a command specified. |
right so this is definitely very similar to the issue we are hitting with openshift and gcp 😅 |
How do you manage it in openshift? build your own version of the credential provider without gcp support? |
I wonder why the gcp timeout issue doesn't happen in my kind cluster (not running on gcp or google hardware) the same way the aws one does. |
Because of the way openshift disallow the network operation. On kind it’s probably getting a 404 or something really quickly, whereas openshift let the network hang it seems (not sure about the details..) |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
We definitely want to avoid long delays on a first taskrun esp. if they are caused by a dependency that might not even be in use. I think ultimately that TEP-0060 remote resource resolution would be a way of addressing this (or maybe it would just move the problem?) It seems to me like this issue should block #3661 - on the other hand, maybe the short term workaround @sbwsg added is enough to consider this closed? /remove-lifecycle stale |
I don't think TEP-0060 is going to provide us any cover here unfortunately. When a @vdemeester I just tried pulling in
Is that something you've seen before with Edit: Bit more investingating. This import error is particularly weird because the So I tried bringing those |
Oh.. we may need to fix the master branch though.. 😓 There shouldn't be any I need to check on something.. 🙃 |
It looks as though go.mod in k8s-pkg-credentialprovider 1.22 has dropped the I've just tried updating to the 1.22 commit hash in our |
/assign |
Possibly of interest, we're undergoing a bit of a refactor in ggcr around how various cloud auth providers are made available -- see google/go-containerregistry#1231 (and google/go-containerregistry#1234) Basically, we all hate In doing so we might accidentally fix the aws-sdk-go startup slowness 🤞 |
I tested the changes @imjasonh made in adf030 without these fake aws credentials and it does appear that the ten minutes slow down we were observing in #4087 are fixed. This PR backs out the phony AWS creds that were in place on the controller to work around long delays on initial image fetches for entrypoint lookup (and bundles) related to k8schain.
I tested the changes @imjasonh made in adf030 without these fake aws credentials and it does appear that the ten minutes slow down we were observing in tektoncd#4087 are fixed. This PR backs out the phony AWS creds that were in place on the controller to work around long delays on initial image fetches for entrypoint lookup (and bundles) related to k8schain.
This is a tracking issue for a problem we're experiencing with a transitive dependency of Tekton Pipelines.
aws-sdk-go
appears to cause very long delays (~10 minutes) starting the first TaskRun of a freshly deployed Pipelines Controller under certain conditions.Identifying the Problem
The issue manifests for users as a very long delay during the startup of a TaskRun. The status of the TaskRun will remain completely unpopulated for a very long time and then, with a default timeout of 10 minutes, it will eventually fail. See Issue 3627 and Issue 4084 for two examples of this problem showing up in user-facing contexts.
This problem has also shown up during development in my Kind cluster when attempting to execute a TaskRun with a Task from a Tekton Bundle. A combination of debug logging and the go
runtime.trace
package helped isolate the issue to a single call tok8schain.New
. The below trace snippet shows 8 sequential network connection attempts lasting 75s each. The connections appear to originate in theaws-sdk-go
package. These occurred during a single TaskRun execution that utilized a Task from a Tekton Bundle:Apparent Root Cause
Searching around for supporting evidence the following issue comment (and many posts across different projects) shows up: zalando-incubator/kube-ingress-aws-controller#309 (comment)
Most of my own understanding of what's going on is from that issue comment, as well as an AWS blog post (see the section "Protecting against open layer 3 firewalls and NATs") and the description of IP TTL at https://en.wikipedia.org/wiki/Time_to_live#IP_packets.
In summary:
The AWS go SDK introduced support for a new metadata service. One of the security features is that the TTL of the requests sent to the metadata service is set to 1. That TTL works great from AWS machines but not from anywhere else. With a TTL so low the networking machinery will drop the packet after only a single hop. Tekton Pipelines indirectly attempts to initialize the AWS SDK due to its dependency on
go-containerregistry
/k8schain
/k8s-pkg-credentialprovider
. This happens, for example, when a bundle is fetched or an image's entrypoint is looked up from a registry. So Pipelines issues whatever request to the AWS SDK and those packets are almost immediately dropped by the network. I assume it then re-attempts the requests several times, resulting in the 8 75s-long requests in my trace above.What remains slightly confusing from my perspective: Why doesn't this result in an error returned to Tekton Pipelines when the packets are dropped? According to the blog post and wikipedia article linked above we should expect to receive an ICMP error datagram when the TTL hits 0 without reaching its destination. But it doesn't seem like we do or if we do then it's ignored by one of our dependencies. Looking again at the trace above there is one request initially that seems to fail very quickly but then there are 8 subsequent connection attempts that each take a long time to close.
Short-Term Mitigation
In the short term we can work around this problem by supplying false AWS credentials in the Pipeline Controller deployment. This seems to short-circuit the connections that the AWS SDK is making on Tekton Pipeline's behalf. Confirmed here and here.
Solving the Problem
In the long term it seems like options are:
k8s-pkg-credentialprovider
to update theaws-sdk-go
dependency to see if that resolves this particular problem.k8s-pkg-credentialprovider
? A new env var or similar that the controller deployment can include?The text was updated successfully, but these errors were encountered: