Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(*PipelineRun).IsDone() is incorrect #1680

Closed
gpaul opened this issue Dec 4, 2019 · 21 comments · Fixed by #2774
Closed

(*PipelineRun).IsDone() is incorrect #1680

gpaul opened this issue Dec 4, 2019 · 21 comments · Fixed by #2774
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/question Issues or PRs that are questions around the project or a particular feature

Comments

@gpaul
Copy link
Contributor

gpaul commented Dec 4, 2019

The (*PipelineRun).IsDone() function returns true when the Success condition is no longer Unknown. However, as soon as the first task in the PipelineRun fails it sets the PipelineRun's Success condition to False, and IsDone() immediately returns true, despite several other tasks still Running.

Instead, IsDone() should check Status.CompletionTime or iterate through all tasks and check whether all tasks have finished executing.

Two new HasSucceeded() and HasFailed() functions can take the place of IsDone() if needed.

I'm happy to submit a PR if this sounds sensible.

@vdemeester
Copy link
Member

@gpaul thanks for the issue, I think this is by design "initially" — aka, we mark the PipelineRun as failed as soon as one TaskRun failed, even if other are running, as we know the PipelineRun is gonna fail anyway.

That said, it might not be the best behaviour, and even more in relation to #1394, #1376 and #1023 👼

/cc @bobcatfish @imjasonh
/kind question
/area api

@tekton-robot tekton-robot added kind/question Issues or PRs that are questions around the project or a particular feature area/api Indicates an issue or PR that deals with the API. labels Dec 4, 2019
@gpaul
Copy link
Contributor Author

gpaul commented Dec 4, 2019

@gpaul thanks for the issue, I think this is by design "initially" — aka, we mark the PipelineRun as failed as soon as one TaskRun failed, even if other are running, as we know the PipelineRun is gonna fail anyway.

Yes, that makes total sense. The issue is really that we treat this state as IsDone() == true, which is not the case (we are merely saying that the pipelinerun has failed, not that it has completed.)

@vdemeester
Copy link
Member

Yes, that makes total sense. The issue is really that we treat this state as IsDone() == true, which is not the case (we are merely saying that the pipelinerun has failed, not that it has completed.)

That's fair 😝.

@dibyom
Copy link
Member

dibyom commented Dec 11, 2019

Two new HasSucceeded() and HasFailed() functions can take the place of IsDone() if needed.

Sounds reasonable!

I'm happy to submit a PR if this sounds sensible.

That'd be awesome! Thanks 🙏

(Also remove area/api label since this only technically affects a function in the reconciler)

@dibyom dibyom removed the area/api Indicates an issue or PR that deals with the API. label Dec 11, 2019
@gpaul
Copy link
Contributor Author

gpaul commented Dec 16, 2019

I've had a look and it's actually rather tricky.

Two strategies present themselves:

  1. The first strategy is to use the presence of PipelineRun.CompletionTime to determine whether or not a PipelineRun is done:
func (pr *PipelineRun) IsDone() bool {
    return pr.Status.CompletionTime != nil && !pr.Status.CompletionTime.IsZero()
}

This is very simple and assumes that a PipelineRun is done if and only if it has a CompletionTime set.

This sounds very reasonable, but it is not currently true.

For example, this line sets the PipelineRun's CompletionTime when a Success/Failed condition is set, which happens when the first taskrun fails:
https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L627-L631

  1. The second alternative is to use PipelineRun.Status.TaskRuns and check whether all task runs have a status of Success or Failure.
func (pr *PipelineRun) IsDone() bool {
	for _, trs := range pr.Status.TaskRuns {
		if trs.Status.GetCondition(apis.ConditionSucceeded).IsUnknown() {
			return false
		}
	}
	return true
}

However, this naive approach means that until a PipelineRun that does not yet have .Status.TaskRuns populated with all the tasks that need to execute will be considered done. For example, a PipelineRun that was created a moment ago and has not yet had its Status updated at all will report IsDone() == true. That's also pretty bad.

It looks like the only place where we really know what state a PipelineRun is in, is in the reconciler code (https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L445-L446). Here we seem to have intimate knowledge of which tasks should have been executed, which have done so successfully, which failed, and which are still pending.

I propose we introduce a new invariant: PipelineRun.CompletionTime is set if and only if all TaskRuns that comprise that PipelineRun and will execute as part of that PipelineRun have completed.

A special case is where the PipelineRun has timed out/canceled. At that point, its TaskRuns are already timed out/canceled, too.

We then update the reconciliation logic here (https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L445-L446) to set the CompletionTime if and only if all non-skipped TaskRuns have succeeded or failed (whether due to regular failure, cancellation, or timeout.)

I've prepared a WIP PR (some outstanding tests) here: #1749

I'm not sure how you'd like to to proceed. Using CompletionTime to carry a binary "Completed" state feels a little strange to me, I would prefer to set the ConditionSucceeded=False condition on a PipelineRun once all tasks have completed, timed out, or been cancelled. And if someone is interested in whether any of its individual tasks have failed, they can query such a condition on the pipelinerun or loop through its Status.TaskRuns.

@afrittoli
Copy link
Member

afrittoli commented Mar 11, 2020

@gpaul thanks for the issue, I think this is by design "initially" — aka, we mark the PipelineRun as failed as soon as one TaskRun failed, even if other are running, as we know the PipelineRun is gonna fail anyway.

Yes, that makes total sense. The issue is really that we treat this state as IsDone() == true, which is not the case (we are merely saying that the pipelinerun has failed, not that it has completed.)

I agree that the semantic of the name isDone is not entirely correct. The function is what the controller needs as an early check to stop processing the resource, so it's "done" from a controller POV.
What are the use cases for which we need an isDone function that takes in account if all tasks are completed?

I'm not sure if this has an impact on metrics @hrishin . If the duration reported for a pipeline is that until isDone is reached, it would be a misleading. While the time spent by the pipelinerun controller on the pipeline would be correct, it would probably not match the expectation, which would probably be the time between the creating in etcd until the time the last taskrun completed.

@afrittoli
Copy link
Member

afrittoli commented Mar 13, 2020

Looking at the code I see a few issues with the current implementation:

  • the timeout handler is released, which means that the pipeline timeout will not be applied to tasks that might be still running. They will still get the task timeout, but that may be not set
  • the metric reported for the duration of the pipeline is invalid

@gpaul your considerations about sounds very reasonable to me. An alternative could be to automatically cancel all running tasks the moment the first task fails. Even if we did that, it might take a bit of time for all the tasks to end, as they may have cloudEvent to send and - soon enough - "finally" type of steps to be executed.

@gpaul
Copy link
Contributor Author

gpaul commented Mar 13, 2020

What are the use cases for which we need an isDone function that takes in account if all tasks are completed?

I have a two real use cases where I need to wait for the entire pipeline to finish:

  1. I have a process that waits for a pipeline to finish, then archives the pod logs of all the taskruns that ran. That process needs to know when it should run and when it can stop watching a given pipelinerun.

  2. I have a process that sets CI task statuses on GitHub PRs, one for each Task in the pipeline. If one of two TaskRuns that are executing concurrently fails, the other will continue to run. However, if my process uses pipelinerun.IsDone() to determine whether or not it can stop watching the given pipelinerun, then it needs to immediately mark the pending TaskRun as Succeeded/Failed/Canceled despite the fact that the TaskRun will continue running and succeed/fail at a later point. This is awkward as a user clicking through to the tekton UI will see tasks succeeding / failing that are shown as neither on the PR.

I currently work around this by waiting for all taskruns to finish if the pipelinerun has failed, or for all expected tasks to have run and completed if the pipelinerun has not failed. In general knowing what tasks are expected to execute is tricky (e.g., Condition resources) and the logic necessary to calculate whether a pipelinerun has finally completed requires the implementer to have intimate knowledge of the inner workings of tekton, in fact I ended up copy/pasting a bunch of code from its guts.

I'm happy to capture the new semantics in a new function. Something like "IsCompleted", perhaps.

In general, the pattern of watching a pipelinerun for updates and waiting until all tasks have completed is a very convenient place to hook into its lifecycle. I would prefer a solution where a pipelinerun transitions through a sequence of lifecycle events and any controller watching it can take appropriate action by querying its current state.

@gpaul
Copy link
Contributor Author

gpaul commented Mar 13, 2020

Looking at the code I see a few issues with the current implementation:

Thanks for looking at the code. I think a more in-depth design is called for to expose pipelinerun lifecycle hooks. I think my PR is too narrow.

@pritidesai
Copy link
Member

@gpaul, In pipelinerun.go there is an additional check pipelineState.IsDone() which checks whether any of the tasks still has taskRun associated with it and returns false if there is any (reconciler does check pipelineState for any remaining tasks):

https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L409

Screen Shot 2020-03-16 at 6 16 41 PM

Hope this helps 😉

@pritidesai
Copy link
Member

pritidesai commented Mar 18, 2020

Ran a small test with the following PipelineRun:

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: failure-task
spec:
  steps:
    - name: fail
      image: ubuntu
      script: exit 1
---

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: successful-task
spec:
  steps:
    - name: success
      image: ubuntu
      script: |
        sleep 10;
        exit 0
---

apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: 01-pipeline-failure
spec:
  tasks:
    - name: task1
      taskRef:
        Name: failure-task
    - name: task2
      taskRef:
        Name: successful-task
---

apiVersion: tekton.dev/v1alpha1
kind: PipelineRun
metadata:
  name: 01-pipelinerun-failure
spec:
  pipelineRef:
    name: 01-pipeline-failure

Factored in task2 completion time even when task1 failed before task2:

tkn pr describe 01-pipelinerun-failure
Name:           01-pipelinerun-failure
Namespace:      default
Pipeline Ref:   01-pipeline-failure

🌡️  Status

STARTED          DURATION     STATUS
37 seconds ago   19 seconds   Failed

💌 Message

TaskRun 01-pipelinerun-failure-task1-kct6t has failed ("step-fail" exited with code 1 (image: "docker-pullable://ubuntu@sha256:04d48df82c938587820d7b6006f5071dbbffceb7ca01d2814f81857c631d44df"); for logs run: kubectl -n default logs 01-pipelinerun-failure-task1-kct6t-pod-knztl -c step-fail)

📦 Resources

 No resources

⚓ Params

 No params

🗂  Taskruns

 NAME                                         TASK NAME   STARTED          DURATION     STATUS
 ∙ 01-pipelinerun-failure-task1-kct6t   task1       37 seconds ago   12 seconds   Failed
 ∙ 01-pipelinerun-failure-task2-x9gbv   task2       37 seconds ago   19 seconds   Succeeded

@gpaul
Copy link
Contributor Author

gpaul commented Mar 18, 2020

Good to know, thanks. Note that the problem is not that the completion time is incorrect, but that pr.IsDone returns true before completionTime is set.

@pritidesai
Copy link
Member

pritidesai commented Mar 18, 2020

🤔 for the above sample run, PipelineRun has

CompletionTime: (*v1.Time)(0xc000b83000)(2020-03-18 16:33:50 +0000 UTC),

here

which matches with task2 completion time here

CompletionTime: (*v1.Time)(0xc000b831e0)(2020-03-18 16:33:50 +0000 UTC),

instead of task1 completion time here

CompletionTime: (*v1.Time)(0xc000b83080)(2020-03-18 16:33:43 +0000 UTC),

@gpaul
Copy link
Contributor Author

gpaul commented Mar 18, 2020

OK - so you're saying that the CompletionTime is set the value of the last taskrun that completed?

@pritidesai
Copy link
Member

OK - so you're saying that the CompletionTime is set the value of the last taskrun that completed?

yup, PipelineRun completion time is set to the completion time of a TaskRun which was completed last. When a task fails while other tasks are still running, PipelineRun waits for those tasks to finish and the longest task's completion time out of those, is picked up as PipelineRun completion time.

@pritidesai
Copy link
Member

pritidesai commented Mar 19, 2020

pr.IsDone() does return true after PipelineRun.Status.Status.Conditions[*].Status is no longer Unknown i.e. set to False on task failure, but pipeline completion time is set to the longest of the CompletionTime among PipelineRun.Status.PipelineRunStatusFields[*].PipelineRunTaskRunStatus[*].CompletionTime().

PipelineRun completion time keeps on updating and in the end is set to last task completion time.

You might be retrieving completion time as soon as pr.IsDone() returns true, but reconciler does not return on pr.IsDone (checks pipelineState as well before returning).

I am trying to think what can we do here without breaking existing callers of IsDone 🤔

@afrittoli
Copy link
Member

The pipelinerun reconciler defines a Reconcile method, which invokes isDone().

When isDone() returns true, there are a few consequences:

  • the pipeline artifact storage is cleaned up, which may break tasks that are still running
  • the timeout handler is released for the pipelinerun
  • metrics are recorded
  • the reconcile method is not invoked anymore on the pipeline, which is where the extra check on pipelinestate is executed

As far as I can tell, the reason for the completion time being correct on the pipelinerun is that the taskrun controller updates the parent pipelinerun when it changes status.

@pritidesai
Copy link
Member

pritidesai commented Mar 20, 2020

As far as I can tell, the reason for the completion time being correct on the pipelinerun is that the taskrun controller updates the parent pipelinerun when it changes status.

As soon as the first failure is hit, pipelineRun status is updated to Status: (v1.ConditionStatus) (len=7) "False" which is not equal to original status as original still has Status: (v1.ConditionStatus) (len=7) "Unknown", this results in a call to updateStatus.

The competition time on the pipelinerun is updated by the reconciler updateStatus method when pipelineRun Status is either False (matches here in case of first failure) or True using

pr.Status.CompletionTime = &metav1.Time{Time: time.Now()} 

@bobcatfish bobcatfish added the kind/bug Categorizes issue or PR as related to a bug. label Mar 23, 2020
@pritidesai
Copy link
Member

pritidesai commented May 29, 2020

/assign pritidesai

I need to get this fix while working on finally, finally tasks are executed after one of the pipelinetasks fail, pipeline status should not be updated until finally tasks have finished executing.

finally is getting implemented with PR #2661

@gpaul
Copy link
Contributor Author

gpaul commented Jun 11, 2020

Lovely!

@pritidesai
Copy link
Member

indeed, all thanks to @afrittoli 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/question Issues or PRs that are questions around the project or a particular feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants