Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e TestGitPipelineRun test fails on newly installed pipeline #3627

Closed
barthy1 opened this issue Dec 10, 2020 · 22 comments
Closed

e2e TestGitPipelineRun test fails on newly installed pipeline #3627

barthy1 opened this issue Dec 10, 2020 · 22 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@barthy1
Copy link
Member

barthy1 commented Dec 10, 2020

Expected Behavior

TestGitPipelineRun test should pass

Actual Behavior

TestGitPipelineRun test fails because of timeout. Corresponding taskrun is created but no pod is started.
Usually second attempt is successful.

# oc get tr -A
NAMESPACE         NAME                                     SUCCEEDED   REASON   STARTTIME   COMPLETIONTIME
arendelle-8nxtw   git-check-pipeline-run-git-check-72mth

# oc get pod -n arendelle-8nxtw
No resources found in arendelle-8nxtw namespace.

Steps to Reproduce the Problem

  1. Install k8s cluster using kind
  2. Install nightly pipeline build
kubectl apply --filename https://storage.googleapis.com/tekton-releases-nightly/pipeline/latest/release.yaml
  1. Run the test just after pipeline is installed
# go test -v -tags=e2e -count=1 ./test -run ^TestGitPipelineRun/tekton_@_master_with_PR_refspec
  1. Wait 10mins and get the error
# go test -v -tags=e2e -count=1 ./test -run ^TestGitPipelineRun/tekton_@_master_with_PR_refspec
Using kubeconfig at `/root/.kube/config` with cluster ``
Cluster information
- Version: v1.19.1
- BuildDate: 2020-09-14T07:30:52Z
- GoVersion: go1.15
- Platform: linux/amd64
=== RUN   TestGitPipelineRun
=== RUN   TestGitPipelineRun/tekton_@_master_with_PR_refspec
=== PAUSE TestGitPipelineRun/tekton_@_master_with_PR_refspec
=== CONT  TestGitPipelineRun/tekton_@_master_with_PR_refspec
    init_test.go:147: Create namespace arendelle-8nxtw to deploy to
    init_test.go:163: Verify SA "default" is created in namespace "arendelle-8nxtw"
    git_checkout_test.go:103: Creating Git PipelineResource git-source-resource
    git_checkout_test.go:119: Creating PipelineRun git-check-pipeline-run
panic: test timed out after 10m0s
...
FAIL	github.com/tektoncd/pipeline/test	600.088s
FAIL

Additional Info

  • Kubernetes version:
# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.1", GitCommit:"206bcadf021e76c27513500ca24182692aabd17e", GitTreeState:"clean", BuildDate:"2020-09-14T07:30:52Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
# tkn version
Client version: 0.11.0
Pipeline version: v20201210-d58a1ae68f
Triggers version: unknown

/kind bug

@barthy1 barthy1 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2020
@nikhil-thomas
Copy link
Member

does kubectl describe taskrun git-check-pipeline-run-git-check-72mth -n arendelle-8nxtw give any additional information.

@chmouel
Copy link
Member

chmouel commented Dec 14, 2020

You probably would want to set the TEST_KEEP_NAMESPACES environment variable to be able to inspect the logs of the failed container :

pipeline/test/init_test.go

Lines 111 to 116 in d702ca1

if os.Getenv("TEST_KEEP_NAMESPACES") == "" && !t.Failed() {
t.Logf("Deleting namespace %s", namespace)
if err := cs.KubeClient.Kube.CoreV1().Namespaces().Delete(ctx, namespace, metav1.DeleteOptions{}); err != nil {
t.Errorf("Failed to delete namespace %s: %s", namespace, err)
}
}

@barthy1
Copy link
Member Author

barthy1 commented Dec 14, 2020

@nikhil-thomas don't have the taskrun mentioned in this issue any more. Providing describe for new taskrun. From my point of view no additional info there.

# oc describe taskrun git-check-pipeline-run-git-check-7m7kw -n arendelle-lf2x8
Name:         git-check-pipeline-run-git-check-7m7kw
Namespace:    arendelle-lf2x8
Labels:       app.kubernetes.io/managed-by=tekton-pipelines
              tekton.dev/pipeline=git-check-pipeline-run
              tekton.dev/pipelineRun=git-check-pipeline-run
              tekton.dev/pipelineTask=git-check
Annotations:  <none>
API Version:  tekton.dev/v1beta1
Kind:         TaskRun
Metadata:
  Creation Timestamp:  2020-12-14T09:40:42Z
  Generation:          1
  Managed Fields:
    API Version:  tekton.dev/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:tekton.dev/pipeline:
          f:tekton.dev/pipelineRun:
          f:tekton.dev/pipelineTask:
        f:ownerReferences:
          .:
          k:{"uid":"5a46cb1b-7c3e-4c58-9c0a-68fe9bd4942f"}:
            .:
            f:apiVersion:
            f:blockOwnerDeletion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
        .:
        f:resources:
          .:
          f:inputs:
        f:serviceAccountName:
        f:taskSpec:
          .:
          f:resources:
            .:
            f:inputs:
          f:steps:
        f:timeout:
      f:status:
        .:
        f:podName:
    Manager:    controller
    Operation:  Update
    Time:       2020-12-14T09:40:42Z
  Owner References:
    API Version:           tekton.dev/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  PipelineRun
    Name:                  git-check-pipeline-run
    UID:                   5a46cb1b-7c3e-4c58-9c0a-68fe9bd4942f
  Resource Version:        2042242
  Self Link:               /apis/tekton.dev/v1beta1/namespaces/arendelle-lf2x8/taskruns/git-check-pipeline-run-git-check-7m7kw
  UID:                     99d06efb-23d4-4730-8958-0c7692d81135
Spec:
  Resources:
    Inputs:
      Name:  gitsource
      Resource Ref:
        Name:            git-source-resource
  Service Account Name:  default
  Task Spec:
    Resources:
      Inputs:
        Name:  gitsource
        Type:  git
    Steps:
      Args:
        --git-dir=/workspace/gitsource/.git
        show
      Image:  alpine/git
      Name:
      Resources:
  Timeout:  1h0m0s
Events:
  Type    Reason   Age   From     Message
  ----    ------   ----  ----     -------
  Normal  Started  12m   TaskRun

@chmouel no pods are actually created, so cannot inspect the logs.

@barthy1
Copy link
Member Author

barthy1 commented Dec 14, 2020

Update from my side, in 18mins after starting the test, taskrun was finished successfully and it took just several seconds. But obviously the test failed because of 10mins waiting timeout. So I observe some kind of delay for taskrun execution, when I start it first time.

@afrittoli
Copy link
Member

Update from my side, in 18mins after starting the test, taskrun was finished successfully and it took just several seconds. But obviously the test failed because of 10mins waiting timeout. So I observe some kind of delay for taskrun execution, when I start it first time.

Do you wait for all pods to be ready in the tekton-pipelines namespace before starting to run the tests?

@afrittoli
Copy link
Member

Update from my side, in 18mins after starting the test, taskrun was finished successfully and it took just several seconds. But obviously the test failed because of 10mins waiting timeout. So I observe some kind of delay for taskrun execution, when I start it first time.

Do you wait for all pods to be ready in the tekton-pipelines namespace before starting to run the tests?

Looking at the task that installs Tekton it looks like you do, I wonder if there is something else causing the delay... it would be useful if you could share the controller logs?

@barthy1
Copy link
Member Author

barthy1 commented Feb 17, 2021

Do you wait for all pods to be ready in the tekton-pipelines namespace before starting to run the tests?

yes

Looking at the task that installs Tekton it looks like you do, I wonder if there is something else causing the delay... it would be useful if you could share the controller logs?

sure, I will provide the logs when do clean setup

@barthy1
Copy link
Member Author

barthy1 commented Feb 17, 2021

the logs from controller from start till test started to execute
https://gist.github.com/barthy1/58e7fe5bc5dd67335c0c24a93894ae83#file-tekton_controller_logs

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 17, 2021
@ghost
Copy link

ghost commented Jun 29, 2021

@barthy1 does this remain an issue in more recent e2e test runs?

@barthy1
Copy link
Member Author

barthy1 commented Jul 2, 2021

@sbwsg

does this remain an issue in more recent e2e test runs?

Yes, I see the same behaviour with the latest main code

@ghost
Copy link

ghost commented Jul 2, 2021

There's a very sudden jump in the logs from 17:20 to 17:37 where nothing appears to be happening, and then an error is reported related to aws credentials:

{"level":"info","ts":"2021-02-17T17:20:12.695Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"736e98e","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"eb0c2a9b-396c-481d-b7e5-1207486aea80","knative.dev/key":"arendelle-2rtch/task-run-pipeline-run-cancel-retries-1-pdnjuwjf","duration":0.000033238}
E0217 17:37:55.174255       1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors

That error comes from k8schain (more accurately from the k8s-pkg-credentialprovider package) and so I believe that this process is stuck waiting for the k8schain package to do ... something (maybe timeout initializing aws creds? but where did the 17 minute delay come from?). I'm just not sure what is happening yet. I've been seeing a similar problem locally in Kind working on a separate issue. For me locally the delay is exactly 10 minutes.

Pulling in some more folks here who I mentioned the delay to yesterday on Slack. fyi @vdemeester and @imjasonh .

@vdemeester
Copy link
Member

This sounds heavily similar to the GCP error we have with OpenShift (where the URL that k8s-pkg-credentialprovider does query a magic url and waits forever for an answer, completely locking the controller).
The fork is exists only to remove the dependency on k8s.io/kubernetes (which brings way too much to the dependency graph) and the usage of it is in google/go-containerregistry.

Something might have changed in the recent version of this package and/or on the service side of AWS 🤔

@ghost
Copy link

ghost commented Jul 2, 2021

Using the runtime.trace package I'm able to isolate 8x 75s-long calls to http.dialConnFor, that look like they're all spawned from the aws-sdk-go package. So at least in my case I think this accounts for the ten minute delay:

0.000342165	 .         	198	task k8schain (id 1, parent 0) created
0.000345207	 .     3042	198	namespace=default
0.000346755	 .     1548	198	saName=default
0.000348249	 .     1494	198	region k8schain.New started (duration: 10m2.20341841s)
0.005701440	 .  5353191	198	new goroutine 260: github.com/aws/aws-sdk-go/internal/sync/singleflight.(*Group).doCall
0.005805444	 .   104004	260	new goroutine 261: net/http.(*Transport).dialConnFor
75.329407376	75.323601932	260	new goroutine 464: net/http.(*Transport).dialConnFor
150.583420726	75.254013350	260	new goroutine 785: net/http.(*Transport).dialConnFor
225.953546323	75.370125597	260	new goroutine 991: net/http.(*Transport).dialConnFor
301.129792264	75.176245941	260	new goroutine 1258: net/http.(*Transport).dialConnFor
376.398534187	75.268741923	260	new goroutine 1497: net/http.(*Transport).dialConnFor
451.715118090	75.316583903	260	new goroutine 1809: net/http.(*Transport).dialConnFor
527.040350774	75.325232684	260	new goroutine 2069: net/http.(*Transport).dialConnFor
602.203773864	75.163423090	198	total-k8schain-time=10m2.203418063s
602.203775323	 .     1459	198	task end
GC:22.952305ms

@ghost
Copy link

ghost commented Jul 2, 2021

Searching around for similar issues it does sound like the aws sdk introduced a security feature that has caused lots of difficult-to-debug delays for other projects.

Here's one example with a good explanation of what's going on: zalando-incubator/kube-ingress-aws-controller#309 (comment)

Edit: The problem I'm facing now is that all of this AWS stuff is happening way down deep in our dependency tree. Not sure yet how to mitigate or work around.

@ghost
Copy link

ghost commented Jul 2, 2021

One workaround that appears to speed things up is to set nonsense AWS credentials in environment variables. I've updated the pipelines controller deployment with the following change:

λ git diff
diff --git a/config/controller.yaml b/config/controller.yaml
index d07084c5d..728776b4c 100644
--- a/config/controller.yaml
+++ b/config/controller.yaml
@@ -89,6 +89,12 @@ spec:
         - name: config-registry-cert
           mountPath: /etc/config-registry-cert
         env:
+        - name: AWS_ACCESS_KEY_ID
+          value: foobarbaz
+        - name: AWS_SECRET_ACCESS_KEY
+          value: foobarbaz
+        - name: AWS_DEFAULT_REGION
+          value: the-event-horizon

Now my first call to k8schain.New() seem to be completing much quicker.

@barthy1 could you try doing something similar with your pipelines deployment and see if that changes the slowness?

@barthy1
Copy link
Member Author

barthy1 commented Jul 3, 2021

@sbwsg after I added the suggested env variables, I do not observe the slowness mentioned in this issue any more 👍

@ghost
Copy link

ghost commented Jul 7, 2021

Awesome, thanks for the update. This is very concerning. I wonder how many projects are embedding k8schain or otherwise initializing the aws sdk and therefore slowing everything down just because it's a transitive dependency.

@ghost
Copy link

ghost commented Jul 9, 2021

/remove-lifecycle rotten
/lifecycle frozen

@tekton-robot tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 9, 2021
tekton-robot pushed a commit that referenced this issue Jul 12, 2021
Several issues have now reared their head which are directly caused
by an update to the aws-sdk. The update results in extremely long
delays in the execution of tasks after the Pipelines controller is
first deployed in a cluster. The aws-sdk is initialized through
a transitive dependency that pipelines pulls in via k8schain.

Here are the recent issues directly related to this aws-sdk bug:

- #3627 (Since December!)
- #4084

One quick way to work around this problem is to set fake AWS
credentials in the environment of the deployed controller. This
apparently causes the aws-sdk to skip whatever process it has
introduced that causes massive delays in execution. So this commit
does exactly that - set fake aws creds in the deployments env vars.

This is an unfortunate hack to try and mitigate the problem until
a better solution presents itself.
@bobcatfish
Copy link
Collaborator

Does #4073 fix this?

@barthy1
Copy link
Member Author

barthy1 commented Jul 16, 2021

@bobcatfish yes, with the latest pipeline code I do not observe the problems mentioned in this issue.

@barthy1 barthy1 closed this as completed Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

7 participants