Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop executing retried Tasks past their timeouts #4368

Closed
wants to merge 1 commit into from

Conversation

jerop
Copy link
Member

@jerop jerop commented Nov 9, 2021

Changes

Today, when a Task is retried, it executes past its timeout. Even
worse, the Task executing past its timeout causes the execution of
Finally Tasks to be timedout before executing for as long as their
allocated timeouts. As such, the execution of Finally Tasks is not
guaranteed even when a finally timeout has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of finally at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing timeout."

This problem is caused by the the interaction between retries and
timeouts. When a TaskRun is retried:

  • The status of its failed execution is stored in RetriesStatus.
  • The status of the TaskRun is updated, including marking it as
    Running and resetting its StartTime.

Resetting the StartTime is useful in recording the StartTime of
each retry in the RetriesStatus. However, using the reset time as the
reset start time of the TaskRun when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the TaskRun to check
whether the TaskRun has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:

  • not resetting the start time of the TaskRun upon retry, however
    it will be challenging to know the execution times of each retry.
  • keeping track of the actual start time in an extra field, but this is
    information that's already available in the status

References:

Co-authored-by: Lee Bernick lbernick@google.com

/kind bug
/closes #4071

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Docs included if any changes are user facing
  • Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

Fix an issue where retried `Tasks` executed past the `timeouts`. Thereby, guaranteeing that `Finally Tasks` can execute for as long as the allocated `timeouts`. 

@jerop jerop added this to the Pipelines v0.30 milestone Nov 9, 2021
@tekton-robot tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Nov 9, 2021
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign jerop
You can assign the PR to them by writing /assign @jerop in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 9, 2021
@jerop jerop marked this pull request as ready for review November 9, 2021 23:43
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 9, 2021
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go 77.2% 80.6% 3.4

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go 77.2% 80.6% 3.4

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go 77.2% 80.6% 3.4

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go 77.2% 80.6% 3.4

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go 77.2% 80.6% 3.4

@ghost
Copy link

ghost commented Nov 10, 2021

/test pull-tekton-pipeline-integration-tests

/test pull-tekton-pipeline-alpha-integration-tests

Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This problem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <leebernick@google.com>
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go 77.2% 80.6% 3.4

return runtime > timeout
}

func (tr *TaskRun) getStartTime() *metav1.Time {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's a bit confusing for me about this function is that it relies on RetriesStatus already having been updated, and now you have to rely on the getStartTime method instead of Status.StartTime. What could make more sense is updating the tr.Status.StartTime at the same time as RetriesStatus is updated, and then tr.Status.StartTime can be used directly. WDYT?

Copy link
Member Author

@jerop jerop Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's a bit confusing for me about this function is that it relies on RetriesStatus already having been updated, and now you have to rely on the getStartTime method instead of Status.StartTime

it does not rely on RetriesStatus having been updated, the function checks for the actual (oldest) startTime of the TaskRun regardless of whether there were retries or not

  • if there no retries, we use current execution's Status.StartTime
  • if there are retries, we check for the oldest execution's Status.StartTime

What could make more sense is updating the tr.Status.StartTime at the same time as RetriesStatus is updated, and then tr.Status.StartTime can be used directly. WDYT?

That's actually what we do right now as shown below, and that's what causes the problem because we rely on the reset StartTime to determine whether the TaskRun has timed out without considering the time spent by previous attempts

func (c *Reconciler) createTaskRun(ctx context.Context, rprt *resources.ResolvedPipelineRunTask, pr *v1beta1.PipelineRun, storageBasePath string, getTimeoutFunc getTimeoutFunc) (*v1beta1.TaskRun, error) {
logger := logging.FromContext(ctx)
tr, _ := c.taskRunLister.TaskRuns(pr.Namespace).Get(rprt.TaskRunName)
if tr != nil {
// Don't modify the lister cache's copy.
tr = tr.DeepCopy()
// is a retry
addRetryHistory(tr)
clearStatus(tr)
tr.Status.SetCondition(&apis.Condition{
Type: apis.ConditionSucceeded,
Status: corev1.ConditionUnknown,
})
logger.Infof("Updating taskrun %s with cleared status and retry history (length: %d).", tr.GetName(), len(tr.Status.RetriesStatus))
return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{})
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok I understand now, thanks! Can you comment this method to make that a bit more clear?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @lbernick, instead of iterating over the RetriesStatus, can we rely on Status.StartTime such that Status.StartTime is not updated with every retry and kept it as original start time?

Copy link
Member Author

@jerop jerop Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritidesai that's one of the alternatives considered, but we need to know the execution time of each attempt (I believe that's why the startTime was reset on each retry)

not resetting the start time of the TaskRun upon retry, however it will be challenging to know the execution times of each retry

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could make more sense is updating the tr.Status.StartTime at the same time as RetriesStatus is updated

thought @lbernick was suggesting updating the startTime instead of keeping it as the original startTime - which is what we already do - please correct me if I misunderstood that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can go with the second alternative where we record the actual start time as an implementation detail and we use that to compare against the timeout in HasTimedOut function, instead of checking the RetriesStatus

keeping track of the actual start time in an extra field, but this is information that's already available in the status

Copy link
Member

@pritidesai pritidesai Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is, there are multiple StartTime available as part of the taskRun status:

  • tr.Status.StartTime
  • a list of tr.Status.RetriesStatus[].StartTime

Is tr.Status.StartTime changing with each new retry?

Copy link
Member Author

@jerop jerop Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is tr.Status.StartTime changing with each new retry?

@pritidesai yes, that's correct

when we do a retry, we add the status to retry history in line 771 then we clear the status is line 772:

func (c *Reconciler) createTaskRun(ctx context.Context, rprt *resources.ResolvedPipelineRunTask, pr *v1beta1.PipelineRun, storageBasePath string, getTimeoutFunc getTimeoutFunc) (*v1beta1.TaskRun, error) {
logger := logging.FromContext(ctx)
tr, _ := c.taskRunLister.TaskRuns(pr.Namespace).Get(rprt.TaskRunName)
if tr != nil {
// Don't modify the lister cache's copy.
tr = tr.DeepCopy()
// is a retry
addRetryHistory(tr)
clearStatus(tr)
tr.Status.SetCondition(&apis.Condition{
Type: apis.ConditionSucceeded,
Status: corev1.ConditionUnknown,
})
logger.Infof("Updating taskrun %s with cleared status and retry history (length: %d).", tr.GetName(), len(tr.Status.RetriesStatus))
return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{})
}

when you look into what clear status does, you see that it sets the tr.Status.StartTime to nil in line 945:

func clearStatus(tr *v1beta1.TaskRun) {
tr.Status.StartTime = nil
tr.Status.CompletionTime = nil
tr.Status.PodName = ""
}

because the tr.Status.StartTime is nil, then the TaskRun is considered as having not started in the next reconcile loop:

func (c *Reconciler) ReconcileKind(ctx context.Context, tr *v1beta1.TaskRun) pkgreconciler.Event {
logger := logging.FromContext(ctx)
ctx = cloudevent.ToContext(ctx, c.cloudEventClient)
// Read the initial condition
before := tr.Status.GetCondition(apis.ConditionSucceeded)
// If the TaskRun is just starting, this will also set the starttime,
// from which the timeout will immediately begin counting down.
if !tr.HasStarted() {
tr.Status.InitializeConditions()
// In case node time was not synchronized, when controller has been scheduled to other nodes.
if tr.Status.StartTime.Sub(tr.CreationTimestamp.Time) < 0 {
logger.Warnf("TaskRun %s createTimestamp %s is after the taskRun started %s", tr.GetNamespacedName().String(), tr.CreationTimestamp, tr.Status.StartTime)
tr.Status.StartTime = &tr.CreationTimestamp
}
// Emit events. During the first reconcile the status of the TaskRun may change twice
// from not Started to Started and then to Running, so we need to sent the event here
// and at the end of 'Reconcile' again.
// We also want to send the "Started" event as soon as possible for anyone who may be waiting
// on the event to perform user facing initialisations, such has reset a CI check status
afterCondition := tr.Status.GetCondition(apis.ConditionSucceeded)
events.Emit(ctx, nil, afterCondition, tr)
}

which calls initializes conditions, in L111, which includes setting the tr.Start.StartTime to now:

// InitializeConditions will set all conditions in taskRunCondSet to unknown for the TaskRun
// and set the started time to the current time
func (trs *TaskRunStatus) InitializeConditions() {
started := false
if trs.StartTime.IsZero() {
trs.StartTime = &metav1.Time{Time: time.Now()}
started = true
}
conditionManager := taskRunCondSet.Manage(trs)
conditionManager.InitializeConditions()
// Ensure the started reason is set for the "Succeeded" condition
if started {
initialCondition := conditionManager.GetCondition(apis.ConditionSucceeded)
initialCondition.Reason = TaskRunReasonStarted.String()
conditionManager.SetCondition(*initialCondition)
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel like it's a bit odd that getStartTime will return a different value than if you accessed the start time directly - it feels like we're changing the meaning of "start time" so that it now means 2 different things

on the other hand given the way retries are implemented (reusing the same Taskrun) this does make a sort of sense, and has the advantage of containing the logic for this check in one place...

A couple bits of possible food for thought:

  • Runs are going to need this same logic if since iirc TEP-0069 is going to use the same logic for Runs - which makes me wonder if it makes sense to put this logic somewhere that can be leveraged by both vs. as a function on the TaskRun object?
  • What would this logic look like if we DID have multiple TaskRuns created when retrying? We wouldn't be able to overload the start time lookup of one individual taskRun, i think we'd need to look up the previous attempts and find the earliest start time (or keep that information somewhere else as suggested in the alternatives)

The above 2 thoughts have me thinking that TaskRun might not be the best place for this logic but I don't feel strongly about it!

}}

for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
result := tc.taskRun.HasTimedOut(context.Background())
if d := cmp.Diff(result, tc.expectedStatus); d != "" {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 👍

@ghost
Copy link

ghost commented Nov 10, 2021

/lgtm

@tekton-robot tekton-robot assigned ghost Nov 10, 2021
@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2021
@jerop
Copy link
Member Author

jerop commented Nov 10, 2021

/test pull-tekton-pipeline-alpha-integration-tests

workspace in sidecar integration test failed, but it's known to be flaky - #4169

@jerop
Copy link
Member Author

jerop commented Nov 10, 2021

examples/v1beta1/taskruns/authenticating-git-commands.yaml failed with:

        Cloning into './repo'...
        Permission denied, please try again.
        Permission denied, please try again.
        fatal: Could not read from remote repository.
        
        Please make sure you have the correct access rights
        and the repository exists.

could it be flaky 🤔

/test pull-tekton-pipeline-alpha-integration-tests

@jerop
Copy link
Member Author

jerop commented Nov 10, 2021

examples_test.go:144: exit status 1 Output: Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.pipeline.tekton.dev": Post "https://tekton-pipelines-webhook.tekton-pipelines.svc:443/defaulting?timeout=10s": context deadline exceeded

connection issues? test runner timedout?

/test pull-tekton-pipeline-alpha-integration-tests

@jerop
Copy link
Member Author

jerop commented Nov 10, 2021

alpha integration tests FINALLY passed 💃🏾

@ghost
Copy link

ghost commented Nov 10, 2021

The prow machines might have gotten overloaded here. Looking at metrics for the prow instances we peaked at 100% cpu utilization a couple times over the past hour:

image

@ghost
Copy link

ghost commented Nov 10, 2021

The authenticating-git-commands errors were very curious. The logs from the sidecars indicated that ssh server setup succeeded so I can't figure out why the step which pushes to the ssh server failed to do so. I am thinking of rewriting the example YAMLs which rely on sidecars as Integration Tests. This way we might have more precise control on the timeouts, error conditions and logs we can capture when they fail.

@jerop
Copy link
Member Author

jerop commented Nov 10, 2021

@sbwsg thanks for investigating the integration tests failures!

Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that stands out to me is i dont see any integration (my pet name for "reconciler") or end to end tests for this? (Which might be part of why it wasnt caught in the first place) https://github.com/tektoncd/pipeline/blob/main/test/timeout_test.go seems like it might be a good candidate to add something?

@@ -528,6 +528,8 @@ The `timeout` value is a `duration` conforming to Go's
values are `1h30m`, `1h`, `1m`, and `60s`. If you set the global timeout to 0, all `PipelineRuns`
that do not have an individual timeout set will fail immediately upon encountering an error.

The above `timeouts` include the time used to execute `retries`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be handy to link to docs on retries in a pipeline task here, so folks would understand what "retries" means (esp if this is all totally new to them)

@@ -763,7 +763,7 @@ tasks:
### Configuring the failure timeout

You can use the `Timeout` field in the `Task` spec within the `Pipeline` to set the timeout
of the `TaskRun` that executes that `Task` within the `PipelineRun` that executes your `Pipeline.`
of the `TaskRun` (and its `retries`) that executes that `Task` within the `PipelineRun` that executes your `Pipeline.`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading this doc and looking at the example below, it's not clear to me if `timeout: "0h1m30s" would be for each individual TaskRun (and each individual retry) or for the time taken in total across the TaskRun + Retries?

  • if individual, maybe we could say something like "the time allowed for execution of an individual TaskRun, and for each of it's subsequent retries
  • if it's the totaly, maybe something like "the total amount of time allowed for the execution of an individual taskRun and all of it's retries"

(i'm actually not sure this is clearer .. haha this is hard... maybe we can get some input from @geriom XD)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasks can define their own timeout, so my expectation here was that this was provided to override that at Pipeline level. It could be interpreted as a timeout for the sum of all retries of a Task, even though that feels a bit strange to me. Either case I agree we should make it crystal-clear in the docs what this timeout is about :)

@@ -398,6 +398,8 @@ If a `TaskRun` runs longer than its timeout value, the pod associated with the `
means that the logs of the `TaskRun` are not preserved. The deletion of the `TaskRun` pod is necessary in order to
stop `TaskRun` step containers from running.

The above `timeout` is inclusive of the time used to execute the `TaskRun`'s `retries`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar question to above: still not sure if this is the total time for taskrun + retries, or each retry will reset the counter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that this should be the case. A Task author has no knowledge about retries, and when they set a timeout is definitely specific to the single execution of the Task.
A TaskRun may override that value to a different one specific to the runtime environment (slower/faster execution nodes for instance).

When the Task is executed within a PipelineRun with retries, the PipelineRun author may want a mechanism to control the timeout for a TaskRun and its retries but I don't think we provide that today and I would see that as a new feature.

We do provide today a way for the PipelineRun author to control the maximum amount of time that can be spent in the "non-finally" part of the Pipeline (i.e. timeouts.tasks) which can be used to ensure that finally will run.

return runtime > timeout
}

func (tr *TaskRun) getStartTime() *metav1.Time {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel like it's a bit odd that getStartTime will return a different value than if you accessed the start time directly - it feels like we're changing the meaning of "start time" so that it now means 2 different things

on the other hand given the way retries are implemented (reusing the same Taskrun) this does make a sort of sense, and has the advantage of containing the logic for this check in one place...

A couple bits of possible food for thought:

  • Runs are going to need this same logic if since iirc TEP-0069 is going to use the same logic for Runs - which makes me wonder if it makes sense to put this logic somewhere that can be leveraged by both vs. as a function on the TaskRun object?
  • What would this logic look like if we DID have multiple TaskRuns created when retrying? We wouldn't be able to overload the start time lookup of one individual taskRun, i think we'd need to look up the previous attempts and find the earliest start time (or keep that information somewhere else as suggested in the alternatives)

The above 2 thoughts have me thinking that TaskRun might not be the best place for this logic but I don't feel strongly about it!

@afrittoli
Copy link
Member

Today, when a Task is retried, it executes past its timeout. Even
worse, the Task executing past its timeout causes the execution of
Finally Tasks to be timedout before executing for as long as their
allocated timeouts. As such, the execution of Finally Tasks is not
guaranteed even when a finally timeout has been allocated.

If I understand correctly, in case of retries, a Task may execute past
the timeout set at pipeline level for the tasks section of the pipeline.
But it's still correct that each retry is granted its whole own timeout for each retry, as long as this doesn't it the pipeline level set timeout.
Is this correct?

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

All the possible timeout scenarios are a bit mind-boggling 😅

We definitely should honour the PipelineRun defined timeout.tasks introduced in TEP-0046, Task retries should not be allowed to take the execution of the non-finally part of the pipeline beyond that limit.

However retries are defined at pipeline level, so I think any timeout defined at Task or TaskRun level should refer to the single execution of the Task and not that of a collection of retries.

@@ -763,7 +763,7 @@ tasks:
### Configuring the failure timeout

You can use the `Timeout` field in the `Task` spec within the `Pipeline` to set the timeout
of the `TaskRun` that executes that `Task` within the `PipelineRun` that executes your `Pipeline.`
of the `TaskRun` (and its `retries`) that executes that `Task` within the `PipelineRun` that executes your `Pipeline.`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasks can define their own timeout, so my expectation here was that this was provided to override that at Pipeline level. It could be interpreted as a timeout for the sum of all retries of a Task, even though that feels a bit strange to me. Either case I agree we should make it crystal-clear in the docs what this timeout is about :)

@@ -398,6 +398,8 @@ If a `TaskRun` runs longer than its timeout value, the pod associated with the `
means that the logs of the `TaskRun` are not preserved. The deletion of the `TaskRun` pod is necessary in order to
stop `TaskRun` step containers from running.

The above `timeout` is inclusive of the time used to execute the `TaskRun`'s `retries`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that this should be the case. A Task author has no knowledge about retries, and when they set a timeout is definitely specific to the single execution of the Task.
A TaskRun may override that value to a different one specific to the runtime environment (slower/faster execution nodes for instance).

When the Task is executed within a PipelineRun with retries, the PipelineRun author may want a mechanism to control the timeout for a TaskRun and its retries but I don't think we provide that today and I would see that as a new feature.

We do provide today a way for the PipelineRun author to control the maximum amount of time that can be spent in the "non-finally" part of the Pipeline (i.e. timeouts.tasks) which can be used to ensure that finally will run.

}
startTime = &metav1.Time{Time: time.Now()}
}
for _, retry := range tr.Status.RetriesStatus {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the TaskRun controller should ever use any data from the RetriesStatus.

Retries cannot be defined in a TaskRun, they are a Pipeline feature really.
Since this feature has been implemented in the PipelineRun controller as a TaskRun status reset + new Pod, the RetriesStatus was added to the TaskRun status to hold an history of past execution for the benefit of the PipelineRun controller as well as end users.

I don't know the motivation behind implementing retries like this - I assume there was a good reason for not creating new TaskRuns instead. But the fact remains that the RetriesStatus is written and managed by the PipelineRun controller, and I think they should be completely transparent from a TaskRun controller point of view.

@jerop
Copy link
Member Author

jerop commented Jan 7, 2022

/hold

@lbernick is working on refactoring the timeouts logic in #4447

@jerop jerop closed this Jan 7, 2022
@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 7, 2022
@jerop jerop deleted the retries-timeout branch June 11, 2022 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

task time exceeded timeouts.tasks when task retried
6 participants