Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a helper for setting the Succeeded condition on PipelineRun. #2749

Merged
merged 1 commit into from
Jun 4, 2020

Conversation

mattmoor
Copy link
Member

@mattmoor mattmoor commented Jun 4, 2020

These helpers reduce a lot of the boilerplate and give us hooks
where we can eagerly set the CompletionTime field rather than waiting
for updateStatus.

Fixes: #2741

These helpers reduce a lot of the boilerplate and give us hooks
where we can eagerly set the CompletionTime field rather than waiting
for `updateStatus`.

Fixes: tektoncd#2741
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 4, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 90.5% 77.6% -12.9
pkg/reconciler/pipelinerun/pipelinerun.go 80.5% 80.2% -0.2

@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

1 similar comment
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@mattmoor
Copy link
Member Author

mattmoor commented Jun 4, 2020

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 4, 2020
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +230 to +248
// MarkSucceeded changes the Succeeded condition to True with the provided reason and message.
func (pr *PipelineRunStatus) MarkSucceeded(reason, messageFormat string, messageA ...interface{}) {
pipelineRunCondSet.Manage(pr).MarkTrueWithReason(apis.ConditionSucceeded, reason, messageFormat, messageA...)
succeeded := pr.GetCondition(apis.ConditionSucceeded)
pr.CompletionTime = &succeeded.LastTransitionTime.Inner
}

// MarkFailed changes the Succeeded condition to False with the provided reason and message.
func (pr *PipelineRunStatus) MarkFailed(reason, messageFormat string, messageA ...interface{}) {
pipelineRunCondSet.Manage(pr).MarkFalse(apis.ConditionSucceeded, reason, messageFormat, messageA...)
succeeded := pr.GetCondition(apis.ConditionSucceeded)
pr.CompletionTime = &succeeded.LastTransitionTime.Inner
}

// MarkRunning changes the Succeeded condition to Unknown with the provided reason and message.
func (pr *PipelineRunStatus) MarkRunning(reason, messageFormat string, messageA ...interface{}) {
pipelineRunCondSet.Manage(pr).MarkUnknown(apis.ConditionSucceeded, reason, messageFormat, messageA...)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😻

@tekton-robot tekton-robot requested review from afrittoli and a user June 4, 2020 08:48
@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2020
Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

I was under the impression that in some cases we set the status to "ConditionFalse" even in case of transient errors, but I cannot find an example anymore - perhaps it was in the taskrun controller - so I think this is good!

If we find a case of transient error in the future we'll need to make sure we don't use the MarkFailed helper because we would not want to set the completion time in that case.

The only concern left that I have is that we have an issue with setting the completion time today. We want to mark the pipeline failed as early as possible, however taskruns may still be running when that happens, and we need to keep them running until they finish, which means that the completion time does not really reflect the time when the last sub-resource completed.

To fix that issue with this change in place, it means we won't be able to set the pipeline as failed until all TaskRuns completed. Is this an acceptable approach?

@vdemeester @bobcatfish @pritidesai

/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2020
@tekton-robot tekton-robot merged commit 8ff3169 into tektoncd:master Jun 4, 2020
@mattmoor mattmoor deleted the completion-time branch June 4, 2020 15:04
@pritidesai
Copy link
Member

pritidesai commented Jun 4, 2020

The only concern left that I have is that we have an issue with setting the completion time today. We want to mark the pipeline failed as early as possible, however taskruns may still be running when that happens, and we need to keep them running until they finish, which means that the completion time does not really reflect the time when the last sub-resource completed.

To fix that issue with this change in place, it means we won't be able to set the pipeline as failed until all TaskRuns completed. Is this an acceptable approach?

yes its a valid concern. I havent looked into this yet (will look into it ASAP) but are we now setting pipeline competition time when a first taskRun failure is noted 🤔

	after := resources.GetPipelineConditionStatus(pr, pipelineState, c.Logger, d)
	pr.Status.SetCondition(after)		switch after.Status {
	case corev1.ConditionTrue:
		pr.Status.MarkSucceeded(after.Reason, after.Message)
	case corev1.ConditionFalse:
		pr.Status.MarkFailed(after.Reason, after.Message)

In that case, we need a separate helper to not set the competition time on failure. The issue is GetPipelineConditionStatus returns corev1.ConditionFalse as soon as a failure is discovered without checking if any other taskRuns are still running. Ideally, GetPipelineConditionStatus should wait before returning failure but at the same time, reconciler should not pick any new tasks while in this waiting phase.

A    B     C
|    |     |
X    Y     Z

Let's say we have three tasks executing in parallel, A, B, and C. Task A fails but Task B and C are still running. Now, Task B finishes execution successfully but C is still running. While we are waiting for Task C to finish, Reconciler should not schedule (Task Y) subsequent node of Task B for execution otherwise we might break backward compatibility of failing pipeline on first task failure.

Trying to address this particular scenario in issue #1680

@afrittoli
Copy link
Member

TBH I'm more inclined now towards changing the logic to setting both status and completion time of the PipelineRun only once all subresources have finished their job.
@pritidesai @vdemeester @bobcatfish @imjasonh wdyt?

@pritidesai
Copy link
Member

pritidesai commented Jun 5, 2020

I tested master with a pipeline having three tasks:

A(failure)    B(successful with sleep 10)     C(successful with sleep 20)

PipelineRun completion time is set to the task A competition time while task B and task C are still running 😢 :

kubectl get pr -o json | jq .items[].status.completionTime
"2020-06-05T19:52:03Z"

kubectl get pr -o json | jq '[ .items[].status.taskRuns[] | (.pipelineTaskName + " " + .status.completionTime) ]'
[
  "task-a 2020-06-05T19:52:03Z",
  "task-b 2020-06-05T19:52:16Z",
  "task-c 2020-06-05T19:52:27Z"
]

tkn CLI sets the pipeline duration to the duration of failed taskRun:

tkn pr describe pipelinerun-one-failure-two-success
Name:           pipelinerun-one-failure-two-success
Namespace:      default
Pipeline Ref:   pipeline-one-failure-two-success
Timeout:        1h0m0s
Labels:
 tekton.dev/pipeline=pipeline-one-failure-two-success

🌡️  Status

STARTED          DURATION     STATUS
18 minutes ago   10 seconds   Failed

💌 Message

TaskRun pipelinerun-one-failure-two-success-task-a-sgwh2 has failed ("step-fail" exited with code 1 (image: "docker-pullable://ubuntu@sha256:747d2dbbaaee995098c9792d99bd333c6783ce56150d1b11e333bbceed5c54d7"); for logs run: kubectl -n default logs pipelinerun-one-failure-two-success-task-a-sgwh2-pod-2xldc -c step-fail
)

📦 Resources

 No resources

⚓ Params

 No params

🗂  Taskruns

 NAME                                                 TASK NAME   STARTED          DURATION     STATUS
 ∙ pipelinerun-one-failure-two-success-task-a-sgwh2   task-a      18 minutes ago   10 seconds   Failed
 ∙ pipelinerun-one-failure-two-success-task-b-82xxr   task-b      18 minutes ago   23 seconds   Succeeded
 ∙ pipelinerun-one-failure-two-success-task-c-ztv7k   task-c      18 minutes ago   34 seconds   Succeeded

@afrittoli
Copy link
Member

Yeah, the completion time is not set along with setting success or failure, which is a change in behaviour. I think the next step on this should be to do so only once all resources spawned by the pipelinerun have completed their work.

@bobcatfish @vdemeester @pritidesai @mattmoor @imjasonh thoughts?
I would be happy to submit a PR for that if we agree on it.
Personally I think it's ok to have v0.13.0 behaving like this.
Alternatively we could roll this back in a v0.13.1 branch (but not on master?) and continue towards the genreconciler for v0.14

@mattmoor
Copy link
Member Author

mattmoor commented Jun 6, 2020

I'm a bit puzzled by why my change affected this unless the Succeeded condition is being set to failed and then unset before the updateStatus? Let me take another look, but a workaround would be to have the MarkUnknown variant of this clear CompletionTime 🤔

@mattmoor
Copy link
Member Author

mattmoor commented Jun 6, 2020

e.g. are we expecting this line to undo a MarkFailed earlier in the reconciliation?

pr.Status.MarkRunning(after.Reason, after.Message)

@afrittoli
Copy link
Member

TBH I would prefer having stop setting the status to failed with no completion time, wait until the DAG is completed and then report how many failed / passed / skipped / cancelled.
I started working on a patch for that, and I'll raise this question on our Monday API WG.

@afrittoli
Copy link
Member

#2774

afrittoli added a commit to afrittoli/pipeline that referenced this pull request Jun 8, 2020
…un."

PR tektoncd#2749 introduces helpers to set the completion time along with
setting the Succeeded condition to Unknown, True or False.

This is fine, however in combination with a previous issue, whereby
we update the Succeeded condition to False in case of failure as soon
as the first failure is encountered, this results in having the
completion time set as soon as the first failure is encountered,
which may not match the actual completion time of the pipeline run,
in case other tasks were already running when the initial failure
occurred.

For v0.13.x we shall keep completion time and condition update
separated. Next release will include this plus a fix to the
original issue.

This reverts commit 8ff3169.
tekton-robot pushed a commit that referenced this pull request Jun 8, 2020
…un."

PR #2749 introduces helpers to set the completion time along with
setting the Succeeded condition to Unknown, True or False.

This is fine, however in combination with a previous issue, whereby
we update the Succeeded condition to False in case of failure as soon
as the first failure is encountered, this results in having the
completion time set as soon as the first failure is encountered,
which may not match the actual completion time of the pipeline run,
in case other tasks were already running when the initial failure
occurred.

For v0.13.x we shall keep completion time and condition update
separated. Next release will include this plus a fix to the
original issue.

This reverts commit 8ff3169.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CompletionTime is not set if Succeeded is already True
5 participants