-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(*PipelineRun).IsDone() is incorrect #1680
Comments
@gpaul thanks for the issue, I think this is by design "initially" — aka, we mark the PipelineRun as failed as soon as one TaskRun failed, even if other are running, as we know the PipelineRun is gonna fail anyway. That said, it might not be the best behaviour, and even more in relation to #1394, #1376 and #1023 👼 /cc @bobcatfish @imjasonh |
Yes, that makes total sense. The issue is really that we treat this state as |
That's fair 😝. |
Sounds reasonable!
That'd be awesome! Thanks 🙏 (Also remove area/api label since this only technically affects a function in the reconciler) |
I've had a look and it's actually rather tricky. Two strategies present themselves:
This is very simple and assumes that a PipelineRun is done if and only if it has a CompletionTime set. This sounds very reasonable, but it is not currently true. For example, this line sets the PipelineRun's CompletionTime when a Success/Failed condition is set, which happens when the first taskrun fails:
However, this naive approach means that until a PipelineRun that does not yet have It looks like the only place where we really know what state a PipelineRun is in, is in the reconciler code (https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L445-L446). Here we seem to have intimate knowledge of which tasks should have been executed, which have done so successfully, which failed, and which are still pending. I propose we introduce a new invariant: A special case is where the PipelineRun has timed out/canceled. At that point, its TaskRuns are already timed out/canceled, too. We then update the reconciliation logic here (https://github.com/tektoncd/pipeline/blob/master/pkg/reconciler/pipelinerun/pipelinerun.go#L445-L446) to set the CompletionTime if and only if all non-skipped TaskRuns have succeeded or failed (whether due to regular failure, cancellation, or timeout.) I've prepared a WIP PR (some outstanding tests) here: #1749 I'm not sure how you'd like to to proceed. Using |
I agree that the semantic of the name I'm not sure if this has an impact on metrics @hrishin . If the duration reported for a pipeline is that until |
Looking at the code I see a few issues with the current implementation:
@gpaul your considerations about sounds very reasonable to me. An alternative could be to automatically cancel all running tasks the moment the first task fails. Even if we did that, it might take a bit of time for all the tasks to end, as they may have cloudEvent to send and - soon enough - "finally" type of steps to be executed. |
I have a two real use cases where I need to wait for the entire pipeline to finish:
I currently work around this by waiting for all taskruns to finish if the pipelinerun has failed, or for all expected tasks to have run and completed if the pipelinerun has not failed. In general knowing what tasks are expected to execute is tricky (e.g., Condition resources) and the logic necessary to calculate whether a pipelinerun has finally completed requires the implementer to have intimate knowledge of the inner workings of tekton, in fact I ended up copy/pasting a bunch of code from its guts. I'm happy to capture the new semantics in a new function. Something like "IsCompleted", perhaps. In general, the pattern of watching a pipelinerun for updates and waiting until all tasks have completed is a very convenient place to hook into its lifecycle. I would prefer a solution where a pipelinerun transitions through a sequence of lifecycle events and any controller watching it can take appropriate action by querying its current state. |
Thanks for looking at the code. I think a more in-depth design is called for to expose pipelinerun lifecycle hooks. I think my PR is too narrow. |
@gpaul, In Hope this helps 😉 |
Ran a small test with the following
Factored in
|
Good to know, thanks. Note that the problem is not that the completion time is incorrect, but that pr.IsDone returns true before completionTime is set. |
🤔 for the above sample run, PipelineRun has
which matches with
instead of
|
OK - so you're saying that the CompletionTime is set the value of the last taskrun that completed? |
yup, PipelineRun completion time is set to the completion time of a TaskRun which was completed last. When a task fails while other tasks are still running, |
You might be retrieving completion time as soon as I am trying to think what can we do here without breaking existing callers of |
The When
As far as I can tell, the reason for the completion time being correct on the |
As soon as the first failure is hit, The competition time on the
|
/assign pritidesai I need to get this fix while working on
|
Lovely! |
indeed, all thanks to @afrittoli 🙏 |
The
(*PipelineRun).IsDone()
function returns true when the Success condition is no longerUnknown
. However, as soon as the first task in the PipelineRun fails it sets the PipelineRun's Success condition toFalse
, andIsDone()
immediately returnstrue
, despite several other tasks stillRunning
.Instead,
IsDone()
should checkStatus.CompletionTime
or iterate through all tasks and check whether all tasks have finished executing.Two new
HasSucceeded()
andHasFailed()
functions can take the place ofIsDone()
if needed.I'm happy to submit a PR if this sounds sensible.
The text was updated successfully, but these errors were encountered: