-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop executing retried Tasks
past their timeouts
#4368
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The following is the coverage report on the affected files.
|
e40eb05
to
eb9ec3b
Compare
The following is the coverage report on the affected files.
|
eb9ec3b
to
37e06b8
Compare
The following is the coverage report on the affected files.
|
37e06b8
to
4f2edbf
Compare
The following is the coverage report on the affected files.
|
4f2edbf
to
685d152
Compare
The following is the coverage report on the affected files.
|
/test pull-tekton-pipeline-integration-tests /test pull-tekton-pipeline-alpha-integration-tests |
Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <leebernick@google.com>
685d152
to
c3a5e4d
Compare
The following is the coverage report on the affected files.
|
return runtime > timeout | ||
} | ||
|
||
func (tr *TaskRun) getStartTime() *metav1.Time { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's a bit confusing for me about this function is that it relies on RetriesStatus already having been updated, and now you have to rely on the getStartTime
method instead of Status.StartTime
. What could make more sense is updating the tr.Status.StartTime
at the same time as RetriesStatus
is updated, and then tr.Status.StartTime can be used directly. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's a bit confusing for me about this function is that it relies on RetriesStatus already having been updated, and now you have to rely on the
getStartTime
method instead ofStatus.StartTime
it does not rely on RetriesStatus
having been updated, the function checks for the actual (oldest) startTime
of the TaskRun
regardless of whether there were retries or not
- if there no retries, we use current execution's
Status.StartTime
- if there are retries, we check for the oldest execution's
Status.StartTime
What could make more sense is updating the
tr.Status.StartTime
at the same time asRetriesStatus
is updated, and then tr.Status.StartTime can be used directly. WDYT?
That's actually what we do right now as shown below, and that's what causes the problem because we rely on the reset StartTime
to determine whether the TaskRun
has timed out without considering the time spent by previous attempts
pipeline/pkg/reconciler/pipelinerun/pipelinerun.go
Lines 763 to 779 in 28f9507
func (c *Reconciler) createTaskRun(ctx context.Context, rprt *resources.ResolvedPipelineRunTask, pr *v1beta1.PipelineRun, storageBasePath string, getTimeoutFunc getTimeoutFunc) (*v1beta1.TaskRun, error) { | |
logger := logging.FromContext(ctx) | |
tr, _ := c.taskRunLister.TaskRuns(pr.Namespace).Get(rprt.TaskRunName) | |
if tr != nil { | |
// Don't modify the lister cache's copy. | |
tr = tr.DeepCopy() | |
// is a retry | |
addRetryHistory(tr) | |
clearStatus(tr) | |
tr.Status.SetCondition(&apis.Condition{ | |
Type: apis.ConditionSucceeded, | |
Status: corev1.ConditionUnknown, | |
}) | |
logger.Infof("Updating taskrun %s with cleared status and retry history (length: %d).", tr.GetName(), len(tr.Status.RetriesStatus)) | |
return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{}) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok I understand now, thanks! Can you comment this method to make that a bit more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @lbernick, instead of iterating over the RetriesStatus
, can we rely on Status.StartTime
such that Status.StartTime
is not updated with every retry and kept it as original start time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritidesai that's one of the alternatives considered, but we need to know the execution time of each attempt (I believe that's why the startTime
was reset on each retry)
not resetting the start time of the
TaskRun
upon retry, however it will be challenging to know the execution times of each retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could make more sense is updating the
tr.Status.StartTime
at the same time asRetriesStatus
is updated
thought @lbernick was suggesting updating the startTime
instead of keeping it as the original startTime
- which is what we already do - please correct me if I misunderstood that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can go with the second alternative where we record the actual start time as an implementation detail and we use that to compare against the timeout in HasTimedOut
function, instead of checking the RetriesStatus
keeping track of the actual start time in an extra field, but this is information that's already available in the status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is, there are multiple StartTime
available as part of the taskRun status:
tr.Status.StartTime
- a list of
tr.Status.RetriesStatus[].StartTime
Is tr.Status.StartTime
changing with each new retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
tr.Status.StartTime
changing with each new retry?
@pritidesai yes, that's correct
when we do a retry, we add the status to retry history in line 771 then we clear the status is line 772:
pipeline/pkg/reconciler/pipelinerun/pipelinerun.go
Lines 763 to 779 in 28f9507
func (c *Reconciler) createTaskRun(ctx context.Context, rprt *resources.ResolvedPipelineRunTask, pr *v1beta1.PipelineRun, storageBasePath string, getTimeoutFunc getTimeoutFunc) (*v1beta1.TaskRun, error) { | |
logger := logging.FromContext(ctx) | |
tr, _ := c.taskRunLister.TaskRuns(pr.Namespace).Get(rprt.TaskRunName) | |
if tr != nil { | |
// Don't modify the lister cache's copy. | |
tr = tr.DeepCopy() | |
// is a retry | |
addRetryHistory(tr) | |
clearStatus(tr) | |
tr.Status.SetCondition(&apis.Condition{ | |
Type: apis.ConditionSucceeded, | |
Status: corev1.ConditionUnknown, | |
}) | |
logger.Infof("Updating taskrun %s with cleared status and retry history (length: %d).", tr.GetName(), len(tr.Status.RetriesStatus)) | |
return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{}) | |
} |
when you look into what clear status does, you see that it sets the tr.Status.StartTime
to nil
in line 945:
pipeline/pkg/reconciler/pipelinerun/pipelinerun.go
Lines 944 to 948 in 28f9507
func clearStatus(tr *v1beta1.TaskRun) { | |
tr.Status.StartTime = nil | |
tr.Status.CompletionTime = nil | |
tr.Status.PodName = "" | |
} |
because the tr.Status.StartTime
is nil
, then the TaskRun
is considered as having not started in the next reconcile loop:
pipeline/pkg/reconciler/taskrun/taskrun.go
Lines 101 to 124 in 28f9507
func (c *Reconciler) ReconcileKind(ctx context.Context, tr *v1beta1.TaskRun) pkgreconciler.Event { | |
logger := logging.FromContext(ctx) | |
ctx = cloudevent.ToContext(ctx, c.cloudEventClient) | |
// Read the initial condition | |
before := tr.Status.GetCondition(apis.ConditionSucceeded) | |
// If the TaskRun is just starting, this will also set the starttime, | |
// from which the timeout will immediately begin counting down. | |
if !tr.HasStarted() { | |
tr.Status.InitializeConditions() | |
// In case node time was not synchronized, when controller has been scheduled to other nodes. | |
if tr.Status.StartTime.Sub(tr.CreationTimestamp.Time) < 0 { | |
logger.Warnf("TaskRun %s createTimestamp %s is after the taskRun started %s", tr.GetNamespacedName().String(), tr.CreationTimestamp, tr.Status.StartTime) | |
tr.Status.StartTime = &tr.CreationTimestamp | |
} | |
// Emit events. During the first reconcile the status of the TaskRun may change twice | |
// from not Started to Started and then to Running, so we need to sent the event here | |
// and at the end of 'Reconcile' again. | |
// We also want to send the "Started" event as soon as possible for anyone who may be waiting | |
// on the event to perform user facing initialisations, such has reset a CI check status | |
afterCondition := tr.Status.GetCondition(apis.ConditionSucceeded) | |
events.Emit(ctx, nil, afterCondition, tr) | |
} |
which calls initializes conditions, in L111, which includes setting the tr.Start.StartTime
to now:
pipeline/pkg/apis/pipeline/v1beta1/taskrun_types.go
Lines 235 to 251 in 28f9507
// InitializeConditions will set all conditions in taskRunCondSet to unknown for the TaskRun | |
// and set the started time to the current time | |
func (trs *TaskRunStatus) InitializeConditions() { | |
started := false | |
if trs.StartTime.IsZero() { | |
trs.StartTime = &metav1.Time{Time: time.Now()} | |
started = true | |
} | |
conditionManager := taskRunCondSet.Manage(trs) | |
conditionManager.InitializeConditions() | |
// Ensure the started reason is set for the "Succeeded" condition | |
if started { | |
initialCondition := conditionManager.GetCondition(apis.ConditionSucceeded) | |
initialCondition.Reason = TaskRunReasonStarted.String() | |
conditionManager.SetCondition(*initialCondition) | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel like it's a bit odd that getStartTime
will return a different value than if you accessed the start time directly - it feels like we're changing the meaning of "start time" so that it now means 2 different things
on the other hand given the way retries are implemented (reusing the same Taskrun) this does make a sort of sense, and has the advantage of containing the logic for this check in one place...
A couple bits of possible food for thought:
- Runs are going to need this same logic if since iirc TEP-0069 is going to use the same logic for Runs - which makes me wonder if it makes sense to put this logic somewhere that can be leveraged by both vs. as a function on the TaskRun object?
- What would this logic look like if we DID have multiple TaskRuns created when retrying? We wouldn't be able to overload the start time lookup of one individual taskRun, i think we'd need to look up the previous attempts and find the earliest start time (or keep that information somewhere else as suggested in the alternatives)
The above 2 thoughts have me thinking that TaskRun might not be the best place for this logic but I don't feel strongly about it!
}} | ||
|
||
for _, tc := range testCases { | ||
t.Run(tc.name, func(t *testing.T) { | ||
result := tc.taskRun.HasTimedOut(context.Background()) | ||
if d := cmp.Diff(result, tc.expectedStatus); d != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch 👍
/lgtm |
/test pull-tekton-pipeline-alpha-integration-tests workspace in sidecar integration test failed, but it's known to be flaky - #4169 |
could it be flaky 🤔 /test pull-tekton-pipeline-alpha-integration-tests |
connection issues? test runner timedout? /test pull-tekton-pipeline-alpha-integration-tests |
alpha integration tests FINALLY passed 💃🏾 |
The |
@sbwsg thanks for investigating the integration tests failures! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that stands out to me is i dont see any integration (my pet name for "reconciler") or end to end tests for this? (Which might be part of why it wasnt caught in the first place) https://github.com/tektoncd/pipeline/blob/main/test/timeout_test.go seems like it might be a good candidate to add something?
@@ -528,6 +528,8 @@ The `timeout` value is a `duration` conforming to Go's | |||
values are `1h30m`, `1h`, `1m`, and `60s`. If you set the global timeout to 0, all `PipelineRuns` | |||
that do not have an individual timeout set will fail immediately upon encountering an error. | |||
|
|||
The above `timeouts` include the time used to execute `retries`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be handy to link to docs on retries in a pipeline task here, so folks would understand what "retries" means (esp if this is all totally new to them)
@@ -763,7 +763,7 @@ tasks: | |||
### Configuring the failure timeout | |||
|
|||
You can use the `Timeout` field in the `Task` spec within the `Pipeline` to set the timeout | |||
of the `TaskRun` that executes that `Task` within the `PipelineRun` that executes your `Pipeline.` | |||
of the `TaskRun` (and its `retries`) that executes that `Task` within the `PipelineRun` that executes your `Pipeline.` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading this doc and looking at the example below, it's not clear to me if `timeout: "0h1m30s" would be for each individual TaskRun (and each individual retry) or for the time taken in total across the TaskRun + Retries?
- if individual, maybe we could say something like "the time allowed for execution of an individual TaskRun, and for each of it's subsequent retries
- if it's the totaly, maybe something like "the total amount of time allowed for the execution of an individual taskRun and all of it's retries"
(i'm actually not sure this is clearer .. haha this is hard... maybe we can get some input from @geriom XD)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tasks
can define their own timeout, so my expectation here was that this was provided to override that at Pipeline
level. It could be interpreted as a timeout for the sum of all retries of a Task
, even though that feels a bit strange to me. Either case I agree we should make it crystal-clear in the docs what this timeout is about :)
@@ -398,6 +398,8 @@ If a `TaskRun` runs longer than its timeout value, the pod associated with the ` | |||
means that the logs of the `TaskRun` are not preserved. The deletion of the `TaskRun` pod is necessary in order to | |||
stop `TaskRun` step containers from running. | |||
|
|||
The above `timeout` is inclusive of the time used to execute the `TaskRun`'s `retries`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar question to above: still not sure if this is the total time for taskrun + retries, or each retry will reset the counter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree that this should be the case. A Task
author has no knowledge about retries, and when they set a timeout is definitely specific to the single execution of the Task
.
A TaskRun
may override that value to a different one specific to the runtime environment (slower/faster execution nodes for instance).
When the Task
is executed within a PipelineRun
with retries, the PipelineRun
author may want a mechanism to control the timeout for a TaskRun
and its retries but I don't think we provide that today and I would see that as a new feature.
We do provide today a way for the PipelineRun
author to control the maximum amount of time that can be spent in the "non-finally" part of the Pipeline
(i.e. timeouts.tasks
) which can be used to ensure that finally
will run.
return runtime > timeout | ||
} | ||
|
||
func (tr *TaskRun) getStartTime() *metav1.Time { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel like it's a bit odd that getStartTime
will return a different value than if you accessed the start time directly - it feels like we're changing the meaning of "start time" so that it now means 2 different things
on the other hand given the way retries are implemented (reusing the same Taskrun) this does make a sort of sense, and has the advantage of containing the logic for this check in one place...
A couple bits of possible food for thought:
- Runs are going to need this same logic if since iirc TEP-0069 is going to use the same logic for Runs - which makes me wonder if it makes sense to put this logic somewhere that can be leveraged by both vs. as a function on the TaskRun object?
- What would this logic look like if we DID have multiple TaskRuns created when retrying? We wouldn't be able to overload the start time lookup of one individual taskRun, i think we'd need to look up the previous attempts and find the earliest start time (or keep that information somewhere else as suggested in the alternatives)
The above 2 thoughts have me thinking that TaskRun might not be the best place for this logic but I don't feel strongly about it!
If I understand correctly, in case of retries, a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this.
All the possible timeout scenarios are a bit mind-boggling 😅
We definitely should honour the PipelineRun
defined timeout.tasks
introduced in TEP-0046, Task
retries should not be allowed to take the execution of the non-finally part of the pipeline beyond that limit.
However retries are defined at pipeline level, so I think any timeout defined at Task
or TaskRun
level should refer to the single execution of the Task
and not that of a collection of retries.
@@ -763,7 +763,7 @@ tasks: | |||
### Configuring the failure timeout | |||
|
|||
You can use the `Timeout` field in the `Task` spec within the `Pipeline` to set the timeout | |||
of the `TaskRun` that executes that `Task` within the `PipelineRun` that executes your `Pipeline.` | |||
of the `TaskRun` (and its `retries`) that executes that `Task` within the `PipelineRun` that executes your `Pipeline.` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tasks
can define their own timeout, so my expectation here was that this was provided to override that at Pipeline
level. It could be interpreted as a timeout for the sum of all retries of a Task
, even though that feels a bit strange to me. Either case I agree we should make it crystal-clear in the docs what this timeout is about :)
@@ -398,6 +398,8 @@ If a `TaskRun` runs longer than its timeout value, the pod associated with the ` | |||
means that the logs of the `TaskRun` are not preserved. The deletion of the `TaskRun` pod is necessary in order to | |||
stop `TaskRun` step containers from running. | |||
|
|||
The above `timeout` is inclusive of the time used to execute the `TaskRun`'s `retries`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree that this should be the case. A Task
author has no knowledge about retries, and when they set a timeout is definitely specific to the single execution of the Task
.
A TaskRun
may override that value to a different one specific to the runtime environment (slower/faster execution nodes for instance).
When the Task
is executed within a PipelineRun
with retries, the PipelineRun
author may want a mechanism to control the timeout for a TaskRun
and its retries but I don't think we provide that today and I would see that as a new feature.
We do provide today a way for the PipelineRun
author to control the maximum amount of time that can be spent in the "non-finally" part of the Pipeline
(i.e. timeouts.tasks
) which can be used to ensure that finally
will run.
} | ||
startTime = &metav1.Time{Time: time.Now()} | ||
} | ||
for _, retry := range tr.Status.RetriesStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the TaskRun
controller should ever use any data from the RetriesStatus
.
Retries cannot be defined in a TaskRun
, they are a Pipeline
feature really.
Since this feature has been implemented in the PipelineRun
controller as a TaskRun
status reset + new Pod
, the RetriesStatus
was added to the TaskRun
status to hold an history of past execution for the benefit of the PipelineRun
controller as well as end users.
I don't know the motivation behind implementing retries like this - I assume there was a good reason for not creating new TaskRuns
instead. But the fact remains that the RetriesStatus
is written and managed by the PipelineRun
controller, and I think they should be completely transparent from a TaskRun
controller point of view.
Changes
Today, when a
Task
is retried, it executes past its timeout. Evenworse, the
Task
executing past its timeout causes the execution ofFinally Tasks
to be timedout before executing for as long as theirallocated timeouts. As such, the execution of
Finally Tasks
is notguaranteed even when a
finally timeout
has been allocated.@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of
finally
at all. This is a really serious problem for operating inproduction, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing
timeout
."This problem is caused by the the interaction between retries and
timeouts. When a
TaskRun
is retried:RetriesStatus
.TaskRun
is updated, including marking it asRunning
and resetting itsStartTime
.Resetting the
StartTime
is useful in recording theStartTime
ofeach retry in the
RetriesStatus
. However, using the reset time as thereset start time of the
TaskRun
when checking whether it has timedoutis incorrect and causes the issues described above.
In this change, we use the actual start time of the
TaskRun
to checkwhether the
TaskRun
has timedout. We do this by checking the starttime of previous attempts as well, instead of the current attempt only.
Alternative approaches considered include:
TaskRun
upon retry, howeverit will be challenging to know the execution times of each retry.
information that's already available in the status
References:
timeouts.tasks
when task retriedCo-authored-by: Lee Bernick lbernick@google.com
/kind bug
/closes #4071
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
Release Notes