Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow for retry on typically transient k8s errors in both core controller and resolver for remote resolution #7894

Merged
merged 1 commit into from
May 16, 2024

Conversation

gabemontero
Copy link
Contributor

@gabemontero gabemontero commented Apr 18, 2024

Changes

collaborating with @khrm here is an augmented version of #7893

Fixes #7909

During both sides of remote resolution (core controller and resolver) typically transient kubernetes errors were being treated as permanent knative errors and no attempts at trying to reconcile again were made, leading to failures which could be avoided.

These changes addresses that.

An example log snippet from the core controller

Pipeline rh-acs-tenant/operator-on-pull-request-bwqxj can't be Run; it contains Tasks that don't exist: Couldn't retrieve Task "": retryable error validating referenced object source-build: Internal error occurred: failed calling webhook "validation.webhook.pipeline.tekton.dev": failed to call webhook: Post "https://tekton-pipelines-webhook.openshift-pipelines.svc:443/resource-validation?timeout=10s": context deadline exceeded

Accompanying log snippet from the resolver

{"level":"error","ts":"2024-04-17T10:50:05.866Z","logger":"controller","caller":"controller/controller.go:566","msg":"Reconcile error","commit":"f0a1d64","knative.dev/traceid":"b893d6a6-2eb7-4a53-b502-1348803a7085","knative.dev/key":"rh-acs-tenant/bundles-780a1fe396cb0f8c702b34e9289fc770","duration":"10.3628985s","error":"error updating resource request \"rh-acs-tenant/bundles-780a1fe396cb0f8c702b34e9289fc770\" with data: Internal error occurred: failed calling webhook \"webhook.pipeline.tekton.dev\": failed to call webhook: Post \"https://tekton-pipelines-webhook.openshift-pipelines.svc:443/defaulting?timeout=10s\": context deadline exceeded","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\t/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:491"}

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • [ n/a] Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • [/ ] Has Tests included if any functionality added or changed
  • [ /] pre-commit Passed
  • [/ ] Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • [/] Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • [ /] Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • [n/a ] Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

This fix address the lack of retry on transient kubernetes errors during remote resolution for tasks, etc. 

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Apr 18, 2024
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 18, 2024
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/resources/taskref.go 94.0% 90.2% -3.9
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 73.9% 0.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/resources/taskref.go 94.0% 90.2% -3.9
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 73.9% 0.8

@gabemontero gabemontero force-pushed the err-log-remote-get-task branch from fd6ea81 to db0ed3f Compare April 18, 2024 18:50
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 73.9% 0.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 73.9% 0.8

@gabemontero
Copy link
Contributor Author

just pushed an extra-credit commit that properly generates TaskNotFoundError when the task ref name is not set in the Name field, but rather comes from the name Param

like:

      taskRef:
        kind: Task
        params:
        - name: name
          value: summary
        - name: bundle
          value: myregistry/myrepo/myimage:mytag_or_sha
        - name: kind
          value: task
        resolver: bundles

I can break it off into a separate PR if desired

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.8% 0.1
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 73.9% 0.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.8% 0.1
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 73.9% 0.8

@savitaashture
Copy link
Contributor

LGTM

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024
@gabemontero gabemontero force-pushed the err-log-remote-get-task branch from 10a9dbe to e4973c0 Compare April 19, 2024 20:09
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/resource/name.go 76.2% 45.7% -30.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/resource/name.go 76.2% 45.7% -30.5

@gabemontero gabemontero force-pushed the err-log-remote-get-task branch from e4973c0 to 0e94804 Compare April 22, 2024 12:41
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/resource/name.go 76.2% 45.7% -30.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/resource/name.go 76.2% 45.7% -30.5

@gabemontero gabemontero force-pushed the err-log-remote-get-task branch from 0e94804 to 4368c8a Compare April 22, 2024 13:01
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/resource/name.go 76.2% 45.7% -30.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/resource/name.go 76.2% 45.7% -30.5

@gabemontero gabemontero force-pushed the err-log-remote-get-task branch from 4368c8a to 84a6c2f Compare April 24, 2024 19:06
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/resource/name.go 76.2% 45.7% -30.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 94.0% 94.8% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/resource/name.go 76.2% 45.7% -30.5

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2024
@gabemontero gabemontero force-pushed the err-log-remote-get-task branch from fab6011 to e96f4a0 Compare May 14, 2024 21:09
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2024
@gabemontero
Copy link
Contributor Author

ok @chitrangpatel @afrittoli I have rebased on top of @chitrangpatel 's recently merged refactor

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 94.3% 95.0% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/resource/name.go 55.3% 65.4% 10.1

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 94.3% 95.0% 0.8
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/resource/name.go 55.3% 65.4% 10.1

Copy link
Contributor

@chitrangpatel chitrangpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make the corresponding changes from pkg/resolution/resolver/framework/reconciler.go and its test

into pkg/remoteresolution/resolver/framework/reconciler.go and its test file?

All the other files are LGTM!

Thats the newer framework. The current one will be deprecated at somepoint.

@afrittoli
Copy link
Member

Can you also make the corresponding changes from pkg/resolution/resolver/framework/reconciler.go and its test

into pkg/remoteresolution/resolver/framework/reconciler.go and its test file?

All the other files are LGTM!

Thats the newer framework. The current one will be deprecated at somepoint.

Thanks @chitrangpatel - as I mentioned on slack, I think we should only do the changes in the new framework, since that's what is actually deployed and used now, and I would not want to have to maintain both. The old framework is still around to provide some room for potential users to move to the new one, but I would not backport changes to it.

@chitrangpatel should we mark the old resolvers as deprecated already?

@chitrangpatel
Copy link
Contributor

chitrangpatel commented May 15, 2024

Can you also make the corresponding changes from pkg/resolution/resolver/framework/reconciler.go and its test
into pkg/remoteresolution/resolver/framework/reconciler.go and its test file?
All the other files are LGTM!
Thats the newer framework. The current one will be deprecated at somepoint.

Thanks @chitrangpatel - as I mentioned on slack, I think we should only do the changes in the new framework, since that's what is actually deployed and used now, and I would not want to have to maintain both. The old framework is still around to provide some room for potential users to move to the new one, but I would not backport changes to it.

@chitrangpatel should we mark the old resolvers as deprecated already?

Yes, I'm happy to. I submitted a PR for that already.

@gabemontero gabemontero force-pushed the err-log-remote-get-task branch from e96f4a0 to 018b243 Compare May 15, 2024 14:49
@gabemontero
Copy link
Contributor Author

ok @chitrangpatel @afrittoli I moved those changes from the deprecated resolution subpackage over to remoteresolution

there was a minor goimport cleanup I left in the old deprecated package reconciler_test.go file

I also squashed the commits and updated the git commit message accordingling

PTAL / thanks

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 93.1% 93.8% 0.7
pkg/remoteresolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resource/name.go 55.3% 65.4% 10.1

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 93.1% 93.8% 0.7
pkg/remoteresolution/resolver/framework/reconciler.go 73.1% 71.8% -1.3
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resource/name.go 55.3% 65.4% 10.1

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chitrangpatel, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [chitrangpatel,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 16, 2024
Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!
I added some minor comments, but nothing blocking.
/lgtm

@@ -645,8 +646,12 @@ func resolveTask(
case errors.Is(err, remote.ErrRequestInProgress):
return rt, err
case err != nil:
name := pipelineTask.TaskRef.Name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: since you've done the investigation, would you mind adding a comment to explain in which situations the name might be empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part of next push

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops .... I pushed the rebase before this update @afrittoli and with @chitrangpatel lgtm the pr has merged

I'll handle this in and the unit test coverage item in a follow up PR I should be able to open today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #7950

Comment on lines +1969 to +1973
errors.New("etcdserver: leader changed"),
context.DeadlineExceeded,
apierrors.NewConflict(pipeline.TaskRunResource, "", nil),
apierrors.NewServerTimeout(pipeline.TaskRunResource, "", 0),
apierrors.NewTimeoutError("", 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

paramString += fmt.Sprintf("name could not be marshalled: %s\n", err.Error())
continue
}
name = string(asJSON)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: coverage shows no coverage for this line, can we add a test case with an object param that can be marshalled as JSON? It can be a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I'll handle this in separate PR ... hopefully should have it open today

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #7950

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2024
…ller and resolver for remote resolution

During both sides of remote resolution (core controller and resolver) typically transient kubernetes errors were being treated as permanent knative errors and no attempts at trying to reconcile again were made, leading to failures which could be avoided.

Then, while diagnosing this, discovered the TaskNotFoundError was missing the Task name when identification comes from params.  That is also addressed.
@gabemontero gabemontero force-pushed the err-log-remote-get-task branch from 018b243 to cf47a44 Compare May 16, 2024 13:49
@tekton-robot tekton-robot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 16, 2024
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 93.3% 94.0% 0.7
pkg/remoteresolution/resolver/framework/reconciler.go 76.1% 74.6% -1.5
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resource/name.go 56.1% 65.5% 9.4

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.7% 96.7% 0.0
pkg/reconciler/taskrun/resources/taskref.go 93.3% 94.0% 0.7
pkg/remoteresolution/resolver/framework/reconciler.go 76.1% 74.6% -1.5
pkg/resolution/common/errors.go 17.6% 13.0% -4.6
pkg/resolution/resource/name.go 56.1% 65.5% 9.4

@chitrangpatel
Copy link
Contributor

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2024
@tekton-robot tekton-robot merged commit 13f45bf into tektoncd:main May 16, 2024
12 of 13 checks passed
@gabemontero gabemontero deleted the err-log-remote-get-task branch May 16, 2024 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tekton-pipelines-remote-resolvers controllers should retry on transient k8s errors
6 participants