Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set pipeline status when all tasks complete #2774

Merged
merged 1 commit into from
Jun 11, 2020

Conversation

afrittoli
Copy link
Member

@afrittoli afrittoli commented Jun 7, 2020

Changes

We used to set the pipeline status to failed as soon as the first
task in the pipeline failed or was cancelled.

As soon as the first task in the pipeline fails or is cancelled, we
stop scheduling new tasks, as we did before, but we will report
status Unknown until all Tasks are complete, with reason "stopping".

This allows to:

  • the completion time at the same time that the status is set
    and avoid inconsistencies
  • wait until all tasks are complete before we cleanup the pipeline
    artifact storage, affinity assistant and record metrics
  • report the correct number of failed / cancelled tasks, as there
    may be more than one. Other tasks that were already running
    when the first failure happened may fail too
  • prepare the pipeline controller more complex workflows, where
    the controller may continue working scheduling after failures

Add test coverage for isSkipped and extend the pipelineresolution
module unit test coverage to capture more input pipeline state.

The DAG / GetNextTasks functions have not been touched at all.
The pipeline run reconciler, when the pipeline run is in stopping
mode, stops asking for next tasks to run. Once all running tasks
finish, the pipeline run finally gets in failed state.

When the pipeline run is in stopping mode, tasks that have not
been started yet are also counted as skipped in the status message
reported to the user.

Fixes #1680

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

In case of task run failure or cancellation, the pipeline run stops scheduling new tasks like before.
The status is marked as failed however only once all the task runs already scheduled are complete.

A new reason "PipelineRunStopping" indicate that the pipeline run has found a failure and is waiting for task runs to complete.

@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 7, 2020
@tekton-robot tekton-robot requested review from bobcatfish and a user June 7, 2020 11:29
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 7, 2020
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

2 similar comments
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@afrittoli afrittoli force-pushed the pipeline_fail_later branch from f4513a1 to e72a32e Compare June 7, 2020 21:14
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 90.5% 89.3% -1.2

@afrittoli afrittoli force-pushed the pipeline_fail_later branch from e72a32e to e7ea796 Compare June 8, 2020 05:00
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 90.5% 89.3% -1.2

@afrittoli
Copy link
Member Author

/kind feature

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 8, 2020
@afrittoli afrittoli force-pushed the pipeline_fail_later branch from e7ea796 to 25ca466 Compare June 8, 2020 10:48
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 90.5% 89.3% -1.2

@afrittoli afrittoli force-pushed the pipeline_fail_later branch from 25ca466 to 16d7973 Compare June 8, 2020 15:33
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 90.5% 90.1% -0.3

@afrittoli afrittoli force-pushed the pipeline_fail_later branch from 16d7973 to f57d553 Compare June 8, 2020 15:47
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 81.7% 81.5% -0.2
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 90.5% 90.1% -0.3

@afrittoli afrittoli changed the title WIP Set pipeline status when all tasks complete Set pipeline status when all tasks complete Jun 8, 2020
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2020
@bobcatfish bobcatfish added this to the Pipelines 0.13.1 🐱 milestone Jun 8, 2020
@afrittoli afrittoli changed the title Set pipeline status when all tasks complete WIP Set pipeline status when all tasks complete Jun 8, 2020
@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2020
@afrittoli afrittoli force-pushed the pipeline_fail_later branch from f57d553 to 4e90952 Compare June 8, 2020 20:39
@tekton-robot tekton-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 8, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 81.7% 81.5% -0.2
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 90.5% 91.8% 1.3

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2020
@@ -231,6 +231,9 @@ const (
PipelineRunReasonCancelled PipelineRunReason = "Cancelled"
// PipelineRunReasonTimedOut is the reason set when the PipelineRun has timed out
PipelineRunReasonTimedOut PipelineRunReason = "PipelineRunTimeout"
// ReasonStopping indicates that no new Tasks will be scheduled by the controller, and the
// pipeline will stop once all running tasks complete their work
PipelineRunReasonStopping PipelineRunReason = "PipelineRunStopping"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks little odd, I ran a simple pipeline with these changes and the state transitions from stop to failed, pipeline stopped while task is running 😕

kubectl get pr
NAME                                  SUCCEEDED   REASON                STARTTIME   COMPLETIONTIME
pipelinerun-one-failure-two-success   Unknown     PipelineRunStopping   21s

kubectl get tr
NAME                                               SUCCEEDED   REASON      STARTTIME   COMPLETIONTIME
pipelinerun-one-failure-two-success-task-a-qdphk   False       Failed      25s         14s
pipelinerun-one-failure-two-success-task-b-z8ctx   True        Succeeded   25s         7s
pipelinerun-one-failure-two-success-task-c-pwkhv   Unknown     Running     25s

kubectl get pr
NAME                                  SUCCEEDED   REASON   STARTTIME   COMPLETIONTIME
pipelinerun-one-failure-two-success   False       Failed   31s         5s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying it out on a real cloud - I kind of relied on unit and E2E tests :)

The reason changes as follows:

  • start / running / stopping / failed
    which is what I was hoping to achieve.
    When we extend the pipeline with finally, it could look something like:
  • start / running / stopping / running-finally / failed

The message could also help providing more details.

@afrittoli afrittoli force-pushed the pipeline_fail_later branch from 96e646f to f79dfc9 Compare June 10, 2020 17:42
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 90.5% 91.8% 1.3

@afrittoli afrittoli force-pushed the pipeline_fail_later branch from f79dfc9 to 6712c13 Compare June 10, 2020 17:49
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 90.5% 91.8% 1.3

return false
}

c := t.TaskRun.Status.GetCondition(apis.ConditionSucceeded)
Copy link
Member

@pritidesai pritidesai Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task with a condition should be covered here and should help mark pipeline as started when a condition container has started and condition is executing.

@pritidesai
Copy link
Member

this looks great, thanks @afrittoli
I have some work to do now 😢 on finally PR, it wouldnt make sense to have pipeline as:
started -> running -> stopping -> running -> failed may be / may be not 🤔 , will verify ✍️
/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2020
We used to set the pipeline status to failed as soon as the first
task in the pipeline failed or was cancelled.

As soon as the first task in the pipeline fails or is cancelled, we
stop scheduling new tasks, as we did before, but we will report
status Unknown until all Tasks are complete, with reason "stopping".

This allows to:
- the completion time at the same time that the status is set
  and avoid inconsistencies
- wait until all tasks are complete before we cleanup the pipeline
  artifact storage, affinity assistant and record metrics
- report the correct number of failed / cancelled tasks, as there
  may be more than one. Other tasks that were already running
  when the first failure happened may fail too
- prepare the pipeline controller more complex workflows, where
  the controller may continue working scheduling after failures

Add test coverage for isSkipped and extend the pipelineresolution
module unit test coverage to capture more input pipeline state.

The DAG / GetNextTasks functions have not been touched at all.
The pipeline run reconciler, when the pipeline run is in stopping
mode, stops asking for next tasks to run. Once all running tasks
finish, the pipeline run finally gets in failed state.

When the pipeline run is in stopping mode, tasks that have not
been started yet are also counted as skipped in the status message
reported to the user.
@afrittoli afrittoli force-pushed the pipeline_fail_later branch from 6712c13 to 4318bec Compare June 10, 2020 18:13
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 82.1% 82.0% -0.1
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 90.5% 91.8% 1.3


// runNextSchedulableTask gets the next schedulable Tasks from the dag based on the current
// pipeline run state, and starts them
func (c *Reconciler) runNextSchedulableTask(ctx context.Context, pr *v1beta1.PipelineRun, d *dag.Graph, pipelineState resources.PipelineRunState, as artifacts.ArtifactStorageInterface) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: how to avoid any more reconciler receiver functions and at the same time not explode the reconcile itself? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!
Getting the recorder and the logger from the context helps with that.
This function depends on other receiver functions, so it still needs c, but perhaps it is possible now to reduce the number of receiver functions - in a separate PR

@afrittoli
Copy link
Member Author

this looks great, thanks @afrittoli
I have some work to do now 😢 on finally PR, it wouldnt make sense to have pipeline as:
started -> running -> stopping -> running -> failed may be / may be not 🤔 , will verify ✍️
/lgtm

I think it could be like that as a start, it would work fine still. We could improve on the reason as a PR on top. I would be happy to help with that.

@pritidesai
Copy link
Member

/lgtm

This PR will impact finally PR #2661, I am fine merging this before we merge finally PR, will incorporate necessary changes into that PR.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 11, 2020
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/meow

@tekton-robot
Copy link
Collaborator

@vdemeester: cat image

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2020
@tekton-robot tekton-robot merged commit 3b95494 into tektoncd:master Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(*PipelineRun).IsDone() is incorrect
6 participants