Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEP-0050] Implement PipelineTask OnError #7422

Merged

Conversation

QuanZhang-William
Copy link
Member

@QuanZhang-William QuanZhang-William commented Nov 27, 2023

Changes

Part of #7165. In TEP-0050, we proposed to add an OnError API field under PipelineTask to configure error handling strategy.

This commits leverages PipelineTask.OnError API field introduced in the previous PR, implement the error handling strategy, update related docs and tests.

/kind feature

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Implement "Ignore Task Failure" with new "PipelineTask.OnError" API field (TEP-0050). User can now set `pipelineTask.onError: continue` to ignore failure

@tekton-robot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 27, 2023
@QuanZhang-William
Copy link
Member Author

/test all

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

@QuanZhang-William
Copy link
Member Author

/retest

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

@QuanZhang-William QuanZhang-William marked this pull request as ready for review November 27, 2023 18:25
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 27, 2023
@tekton-robot tekton-robot requested review from dibyom and jerop November 27, 2023 18:25
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

@JeromeJu
Copy link
Member

/assign

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

steps:
- name: failing-step
image: busybox
script: exit 1; echo -n 123 | tee $(results.result1.path)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are missing one quote ' here.
I'm not sure why this doesn't affect the test being run.
Curious how it went through in CI, coz when I create this PipelineRun locally this'd fail runtime validation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing quote added. Interestingly, I tried the same yamlfile and test locally. I verified the resources and test are created with no problem 🤔

test/ignore_task_error_test.go Outdated Show resolved Hide resolved
updateCompletedTaskRunStatus(logger, trs, pod)
onError, ok := tr.Annotations[v1.PipelineTaskOnErrorAnnotation]
if ok {
updateCompletedTaskRunStatus(logger, trs, pod, v1.PipelineTaskOnErrorType(onError))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help check my understanding here:
I saw this func called at taskrun reconciler, but what if the TaskRun fails at earlier stage i.e. pvc creation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JeromeJu. This is a good questions.

TEP-0050 is designed with requirement that the task would be considered "successful" for the purposes of determining the status of the pipeline (i.e. the TaskRun itself should still be failed, but with reason FailureIgnored). Looking at the usecases in TEP-0050, it is mainly used to ignore pod error itself.

I think your question is that for the validation/reconciler errors (like the pvc creation), should we also mark the reason as FailureIgnored?

I think it might not be a good idea make the validation/reconciler errors with reason FailureIgnored:

  • The validation/reconciler errors is likely from the Pipeline/Task Author, it shouldn't be ignored, but should be raised and fixed, FailureIgnored error reason could be confusing.
  • the error reason FailureIgnored becomes too generic if we use it across the TaskRun reconciler code. Instead, I'd prefer to keep the scope of FailureIgnored reason to the pod execution time.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @QuanZhang-William for the offline conversation and completing the 2nd half of my question!

And yes, I do agree that in actual use cases, only the container(Step) failures in the pod, which indicates users' own test failures, shall be ignored but not the "preparation" of those steps i.e. pvc creation and validations. Thanks for the explanation here.

So I think the current implementation is the way to go. One small ask is maybe we would can add this explanation/conversation somewhere in the docString or maybe link to the TEP050 for this design choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense as long as we document that FailedIgnored is ignoring a failure when the actual task execution fails but not otherwise (validation failure, etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG. I have added the docstring. PTAL

@QuanZhang-William
Copy link
Member Author

cc @tektoncd/core-collaborators @tektoncd/core-maintainers @jerop @pritidesai . Please take a look if you are interested!

@dibyom
Copy link
Member

dibyom commented Dec 1, 2023

Overall, this looks good to me! Thanks @QuanZhang-William Once you address the remaining open comments, I can approve.

@QuanZhang-William
Copy link
Member Author

Overall, this looks good to me! Thanks @QuanZhang-William Once you address the remaining open comments, I can approve.

Thanks @dibyom. I've addressed the comments. PTAL!

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

Part of [tektoncd#7165][tektoncd#7165]. In [TEP-0050][tep-0050], we proposed to add an `OnError` API field under `PipelineTask` to configure error handling strategy.

This commits leverages `PipelineTask.OnError` API field introduced in the previous PR, implement the error handling strategy, update related docs and tests.

/kind feature

[tep-0050]: https://github.com/tektoncd/community/blob/main/teps/0050-ignore-task-failures.md
[tektoncd#7165]: tektoncd#7165
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/status.go 92.9% 93.0% 0.1
pkg/reconciler/pipelinerun/pipelinerun.go 92.3% 92.2% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 99.5% 99.5% 0.0

Copy link
Member

@JeromeJu JeromeJu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @QuanZhang-William
One nit on release note, might be helpful to also include how users could use with the OnError feature i.e. "users can now set OnError to 'continue' to ignore failure from previous task".

@@ -278,11 +280,11 @@ func (state PipelineRunState) getNextTasks(candidateTasks sets.String) []*Resolv
}

// IsStopping returns true if the PipelineRun won't be scheduling any new Task because
// at least one task already failed or was cancelled in the specified dag
// at least one task already failed (with onError: stopAndFail) or was cancelled in the specified dag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "stopAndFail" is the default here this sounds like users have to set OnError?
Will leave this up to your decision.

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 4, 2023
Copy link
Member

@dibyom dibyom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other feedback on this before merge? cc @vdemeester @jerop @pritidesai @afrittoli

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dibyom, JeromeJu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@JeromeJu JeromeJu added this to the Pipelines v0.55 milestone Dec 6, 2023
@QuanZhang-William
Copy link
Member Author

Any other feedback on this before merge? cc @vdemeester @jerop @pritidesai @afrittoli

Hi @dibyom. This PR has been opened for more than a week. Was wondering if we can merge? 🙏

Copy link
Member

@Yongxuanzhang Yongxuanzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2023
@tekton-robot tekton-robot merged commit b6d27a8 into tektoncd:main Dec 6, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants