Stop executing retried `Tasks` past their timeouts #4368

jerop · 2021-11-09T23:41:01Z

Changes

Today, when a Task is retried, it executes past its timeout. Even
worse, the Task executing past its timeout causes the execution of
Finally Tasks to be timedout before executing for as long as their
allocated timeouts. As such, the execution of Finally Tasks is not
guaranteed even when a finally timeout has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of finally at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing timeout."

This problem is caused by the the interaction between retries and
timeouts. When a TaskRun is retried:

The status of its failed execution is stored in RetriesStatus.
The status of the TaskRun is updated, including marking it as
Running and resetting its StartTime.

Resetting the StartTime is useful in recording the StartTime of
each retry in the RetriesStatus. However, using the reset time as the
reset start time of the TaskRun when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the TaskRun to check
whether the TaskRun has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:

not resetting the start time of the TaskRun upon retry, however
it will be challenging to know the execution times of each retry.
keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:

Co-authored-by: Lee Bernick lbernick@google.com

/kind bug
/closes #4071

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Docs included if any changes are user facing
Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

Fix an issue where retried `Tasks` executed past the `timeouts`. Thereby, guaranteeing that `Finally Tasks` can execute for as long as the allocated `timeouts`.

tekton-robot · 2021-11-09T23:41:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign jerop
You can assign the PR to them by writing /assign @jerop in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot · 2021-11-09T23:43:56Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go	77.2%	80.6%	3.4

tekton-robot · 2021-11-09T23:50:38Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go	77.2%	80.6%	3.4

tekton-robot · 2021-11-10T00:07:41Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go	77.2%	80.6%	3.4

tekton-robot · 2021-11-10T01:38:19Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go	77.2%	80.6%	3.4

tekton-robot · 2021-11-10T01:57:23Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go	77.2%	80.6%	3.4

ghost · 2021-11-10T14:19:39Z

/test pull-tekton-pipeline-integration-tests

/test pull-tekton-pipeline-alpha-integration-tests

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <leebernick@google.com>

tekton-robot · 2021-11-10T14:50:55Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1beta1/taskrun_types.go	77.2%	80.6%	3.4

lbernick · 2021-11-10T14:52:33Z

pkg/apis/pipeline/v1beta1/taskrun_types.go

 	return runtime > timeout
 }

+func (tr *TaskRun) getStartTime() *metav1.Time {


what's a bit confusing for me about this function is that it relies on RetriesStatus already having been updated, and now you have to rely on the getStartTime method instead of Status.StartTime. What could make more sense is updating the tr.Status.StartTime at the same time as RetriesStatus is updated, and then tr.Status.StartTime can be used directly. WDYT?

what's a bit confusing for me about this function is that it relies on RetriesStatus already having been updated, and now you have to rely on the getStartTime method instead of Status.StartTime

it does not rely on RetriesStatus having been updated, the function checks for the actual (oldest) startTime of the TaskRun regardless of whether there were retries or not

if there no retries, we use current execution's Status.StartTime

if there are retries, we check for the oldest execution's Status.StartTime

What could make more sense is updating the tr.Status.StartTime at the same time as RetriesStatus is updated, and then tr.Status.StartTime can be used directly. WDYT?

That's actually what we do right now as shown below, and that's what causes the problem because we rely on the reset StartTime to determine whether the TaskRun has timed out without considering the time spent by previous attempts

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 763 to 779 in 28f9507

func (c *Reconciler) createTaskRun(ctx context.Context, rprt *resources.ResolvedPipelineRunTask, pr *v1beta1.PipelineRun, storageBasePath string, getTimeoutFunc getTimeoutFunc) (*v1beta1.TaskRun, error) {

logger := logging.FromContext(ctx)

tr, _ := c.taskRunLister.TaskRuns(pr.Namespace).Get(rprt.TaskRunName)

if tr != nil {

// Don't modify the lister cache's copy.

tr = tr.DeepCopy()

// is a retry

addRetryHistory(tr)

clearStatus(tr)

tr.Status.SetCondition(&apis.Condition{

Type: apis.ConditionSucceeded,

Status: corev1.ConditionUnknown,

})

logger.Infof("Updating taskrun %s with cleared status and retry history (length: %d).", tr.GetName(), len(tr.Status.RetriesStatus))

return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{})

}

Oh ok I understand now, thanks! Can you comment this method to make that a bit more clear?

I agree with @lbernick, instead of iterating over the RetriesStatus, can we rely on Status.StartTime such that Status.StartTime is not updated with every retry and kept it as original start time?

@pritidesai that's one of the alternatives considered, but we need to know the execution time of each attempt (I believe that's why the startTime was reset on each retry)

not resetting the start time of the TaskRun upon retry, however it will be challenging to know the execution times of each retry

What could make more sense is updating the tr.Status.StartTime at the same time as RetriesStatus is updated

thought @lbernick was suggesting updating the startTime instead of keeping it as the original startTime - which is what we already do - please correct me if I misunderstood that

maybe we can go with the second alternative where we record the actual start time as an implementation detail and we use that to compare against the timeout in HasTimedOut function, instead of checking the RetriesStatus

keeping track of the actual start time in an extra field, but this is information that's already available in the status

my understanding is, there are multiple StartTime available as part of the taskRun status:

tr.Status.StartTime

a list of tr.Status.RetriesStatus[].StartTime

Is tr.Status.StartTime changing with each new retry?

Is tr.Status.StartTime changing with each new retry?

@pritidesai yes, that's correct

when we do a retry, we add the status to retry history in line 771 then we clear the status is line 772:

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 763 to 779 in 28f9507

func (c *Reconciler) createTaskRun(ctx context.Context, rprt *resources.ResolvedPipelineRunTask, pr *v1beta1.PipelineRun, storageBasePath string, getTimeoutFunc getTimeoutFunc) (*v1beta1.TaskRun, error) {

logger := logging.FromContext(ctx)

tr, _ := c.taskRunLister.TaskRuns(pr.Namespace).Get(rprt.TaskRunName)

if tr != nil {

// Don't modify the lister cache's copy.

tr = tr.DeepCopy()

// is a retry

addRetryHistory(tr)

clearStatus(tr)

tr.Status.SetCondition(&apis.Condition{

Type: apis.ConditionSucceeded,

Status: corev1.ConditionUnknown,

})

logger.Infof("Updating taskrun %s with cleared status and retry history (length: %d).", tr.GetName(), len(tr.Status.RetriesStatus))

return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{})

}

when you look into what clear status does, you see that it sets the tr.Status.StartTime to nil in line 945:

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 944 to 948 in 28f9507

func clearStatus(tr *v1beta1.TaskRun) {

tr.Status.StartTime = nil

tr.Status.CompletionTime = nil

tr.Status.PodName = ""

}

because the tr.Status.StartTime is nil, then the TaskRun is considered as having not started in the next reconcile loop:

pipeline/pkg/reconciler/taskrun/taskrun.go

Lines 101 to 124 in 28f9507

func (c *Reconciler) ReconcileKind(ctx context.Context, tr *v1beta1.TaskRun) pkgreconciler.Event {

logger := logging.FromContext(ctx)

ctx = cloudevent.ToContext(ctx, c.cloudEventClient)

// Read the initial condition

before := tr.Status.GetCondition(apis.ConditionSucceeded)

// If the TaskRun is just starting, this will also set the starttime,

// from which the timeout will immediately begin counting down.

if !tr.HasStarted() {

tr.Status.InitializeConditions()

// In case node time was not synchronized, when controller has been scheduled to other nodes.

if tr.Status.StartTime.Sub(tr.CreationTimestamp.Time) < 0 {

logger.Warnf("TaskRun %s createTimestamp %s is after the taskRun started %s", tr.GetNamespacedName().String(), tr.CreationTimestamp, tr.Status.StartTime)

tr.Status.StartTime = &tr.CreationTimestamp

}

// Emit events. During the first reconcile the status of the TaskRun may change twice

// from not Started to Started and then to Running, so we need to sent the event here

// and at the end of 'Reconcile' again.

// We also want to send the "Started" event as soon as possible for anyone who may be waiting

// on the event to perform user facing initialisations, such has reset a CI check status

afterCondition := tr.Status.GetCondition(apis.ConditionSucceeded)

events.Emit(ctx, nil, afterCondition, tr)

}

which calls initializes conditions, in L111, which includes setting the tr.Start.StartTime to now:

pipeline/pkg/apis/pipeline/v1beta1/taskrun_types.go

Lines 235 to 251 in 28f9507

// InitializeConditions will set all conditions in taskRunCondSet to unknown for the TaskRun

// and set the started time to the current time

func (trs *TaskRunStatus) InitializeConditions() {

started := false

if trs.StartTime.IsZero() {

trs.StartTime = &metav1.Time{Time: time.Now()}

started = true

}

conditionManager := taskRunCondSet.Manage(trs)

conditionManager.InitializeConditions()

// Ensure the started reason is set for the "Succeeded" condition

if started {

initialCondition := conditionManager.GetCondition(apis.ConditionSucceeded)

initialCondition.Reason = TaskRunReasonStarted.String()

conditionManager.SetCondition(*initialCondition)

}

}

I also feel like it's a bit odd that getStartTime will return a different value than if you accessed the start time directly - it feels like we're changing the meaning of "start time" so that it now means 2 different things

on the other hand given the way retries are implemented (reusing the same Taskrun) this does make a sort of sense, and has the advantage of containing the logic for this check in one place...

A couple bits of possible food for thought:

Runs are going to need this same logic if since iirc TEP-0069 is going to use the same logic for Runs - which makes me wonder if it makes sense to put this logic somewhere that can be leveraged by both vs. as a function on the TaskRun object?

What would this logic look like if we DID have multiple TaskRuns created when retrying? We wouldn't be able to overload the start time lookup of one individual taskRun, i think we'd need to look up the previous attempts and find the earliest start time (or keep that information somewhere else as suggested in the alternatives)

The above 2 thoughts have me thinking that TaskRun might not be the best place for this logic but I don't feel strongly about it!

ghost · 2021-11-10T14:57:44Z

pkg/apis/pipeline/v1beta1/taskrun_types_test.go

 	}}

 	for _, tc := range testCases {
 		t.Run(tc.name, func(t *testing.T) {
 			result := tc.taskRun.HasTimedOut(context.Background())
-			if d := cmp.Diff(result, tc.expectedStatus); d != "" {


good catch 👍

ghost · 2021-11-10T14:58:44Z

/lgtm

jerop · 2021-11-10T15:16:31Z

/test pull-tekton-pipeline-alpha-integration-tests

workspace in sidecar integration test failed, but it's known to be flaky - #4169

jerop · 2021-11-10T15:50:35Z

examples/v1beta1/taskruns/authenticating-git-commands.yaml failed with:

        Cloning into './repo'...
        Permission denied, please try again.
        Permission denied, please try again.
        fatal: Could not read from remote repository.
        
        Please make sure you have the correct access rights
        and the repository exists.

could it be flaky 🤔

/test pull-tekton-pipeline-alpha-integration-tests

jerop · 2021-11-10T16:35:53Z

examples_test.go:144: exit status 1 Output: Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.pipeline.tekton.dev": Post "https://tekton-pipelines-webhook.tekton-pipelines.svc:443/defaulting?timeout=10s": context deadline exceeded

connection issues? test runner timedout?

/test pull-tekton-pipeline-alpha-integration-tests

jerop · 2021-11-10T17:10:28Z

alpha integration tests FINALLY passed 💃🏾

ghost · 2021-11-10T17:30:32Z

The prow machines might have gotten overloaded here. Looking at metrics for the prow instances we peaked at 100% cpu utilization a couple times over the past hour:

ghost · 2021-11-10T17:32:20Z

The authenticating-git-commands errors were very curious. The logs from the sidecars indicated that ssh server setup succeeded so I can't figure out why the step which pushes to the ssh server failed to do so. I am thinking of rewriting the example YAMLs which rely on sidecars as Integration Tests. This way we might have more precise control on the timeouts, error conditions and logs we can capture when they fail.

jerop · 2021-11-10T19:24:16Z

@sbwsg thanks for investigating the integration tests failures!

bobcatfish

One thing that stands out to me is i dont see any integration (my pet name for "reconciler") or end to end tests for this? (Which might be part of why it wasnt caught in the first place) https://github.com/tektoncd/pipeline/blob/main/test/timeout_test.go seems like it might be a good candidate to add something?

bobcatfish · 2021-11-12T17:43:45Z

docs/pipelineruns.md

@@ -528,6 +528,8 @@ The `timeout` value is a `duration` conforming to Go's
 values are `1h30m`, `1h`, `1m`, and `60s`. If you set the global timeout to 0, all `PipelineRuns`
 that do not have an individual timeout set will fail immediately upon encountering an error.

+The above `timeouts` include the time used to execute `retries`. 


might be handy to link to docs on retries in a pipeline task here, so folks would understand what "retries" means (esp if this is all totally new to them)

bobcatfish · 2021-11-12T17:47:20Z

docs/pipelines.md

@@ -763,7 +763,7 @@ tasks:
 ### Configuring the failure timeout

 You can use the `Timeout` field in the `Task` spec within the `Pipeline` to set the timeout
-of the `TaskRun` that executes that `Task` within the `PipelineRun` that executes your `Pipeline.`
+of the `TaskRun` (and its `retries`) that executes that `Task` within the `PipelineRun` that executes your `Pipeline.`


reading this doc and looking at the example below, it's not clear to me if `timeout: "0h1m30s" would be for each individual TaskRun (and each individual retry) or for the time taken in total across the TaskRun + Retries?

if individual, maybe we could say something like "the time allowed for execution of an individual TaskRun, and for each of it's subsequent retries

if it's the totaly, maybe something like "the total amount of time allowed for the execution of an individual taskRun and all of it's retries"

(i'm actually not sure this is clearer .. haha this is hard... maybe we can get some input from @geriom XD)

Tasks can define their own timeout, so my expectation here was that this was provided to override that at Pipeline level. It could be interpreted as a timeout for the sum of all retries of a Task, even though that feels a bit strange to me. Either case I agree we should make it crystal-clear in the docs what this timeout is about :)

bobcatfish · 2021-11-12T17:47:45Z

docs/taskruns.md

@@ -398,6 +398,8 @@ If a `TaskRun` runs longer than its timeout value, the pod associated with the `
 means that the logs of the `TaskRun` are not preserved. The deletion of the `TaskRun` pod is necessary in order to
 stop `TaskRun` step containers from running.

+The above `timeout` is inclusive of the time used to execute the `TaskRun`'s `retries`.


similar question to above: still not sure if this is the total time for taskrun + retries, or each retry will reset the counter

I disagree that this should be the case. A Task author has no knowledge about retries, and when they set a timeout is definitely specific to the single execution of the Task.
A TaskRun may override that value to a different one specific to the runtime environment (slower/faster execution nodes for instance).

When the Task is executed within a PipelineRun with retries, the PipelineRun author may want a mechanism to control the timeout for a TaskRun and its retries but I don't think we provide that today and I would see that as a new feature.

We do provide today a way for the PipelineRun author to control the maximum amount of time that can be spent in the "non-finally" part of the Pipeline (i.e. timeouts.tasks) which can be used to ensure that finally will run.

bobcatfish · 2021-11-12T17:57:03Z

pkg/apis/pipeline/v1beta1/taskrun_types.go

 	return runtime > timeout
 }

+func (tr *TaskRun) getStartTime() *metav1.Time {


I also feel like it's a bit odd that getStartTime will return a different value than if you accessed the start time directly - it feels like we're changing the meaning of "start time" so that it now means 2 different things

on the other hand given the way retries are implemented (reusing the same Taskrun) this does make a sort of sense, and has the advantage of containing the logic for this check in one place...

A couple bits of possible food for thought:

Runs are going to need this same logic if since iirc TEP-0069 is going to use the same logic for Runs - which makes me wonder if it makes sense to put this logic somewhere that can be leveraged by both vs. as a function on the TaskRun object?

What would this logic look like if we DID have multiple TaskRuns created when retrying? We wouldn't be able to overload the start time lookup of one individual taskRun, i think we'd need to look up the previous attempts and find the earliest start time (or keep that information somewhere else as suggested in the alternatives)

The above 2 thoughts have me thinking that TaskRun might not be the best place for this logic but I don't feel strongly about it!

afrittoli · 2021-12-02T11:58:54Z

Today, when a Task is retried, it executes past its timeout. Even
worse, the Task executing past its timeout causes the execution of
Finally Tasks to be timedout before executing for as long as their
allocated timeouts. As such, the execution of Finally Tasks is not
guaranteed even when a finally timeout has been allocated.

If I understand correctly, in case of retries, a Task may execute past
the timeout set at pipeline level for the tasks section of the pipeline.
But it's still correct that each retry is granted its whole own timeout for each retry, as long as this doesn't it the pipeline level set timeout.
Is this correct?

afrittoli

Thanks for working on this.

All the possible timeout scenarios are a bit mind-boggling 😅

We definitely should honour the PipelineRun defined timeout.tasks introduced in TEP-0046, Task retries should not be allowed to take the execution of the non-finally part of the pipeline beyond that limit.

However retries are defined at pipeline level, so I think any timeout defined at Task or TaskRun level should refer to the single execution of the Task and not that of a collection of retries.

afrittoli · 2021-12-02T12:04:22Z

docs/pipelines.md

@@ -763,7 +763,7 @@ tasks:
 ### Configuring the failure timeout

 You can use the `Timeout` field in the `Task` spec within the `Pipeline` to set the timeout
-of the `TaskRun` that executes that `Task` within the `PipelineRun` that executes your `Pipeline.`
+of the `TaskRun` (and its `retries`) that executes that `Task` within the `PipelineRun` that executes your `Pipeline.`


Tasks can define their own timeout, so my expectation here was that this was provided to override that at Pipeline level. It could be interpreted as a timeout for the sum of all retries of a Task, even though that feels a bit strange to me. Either case I agree we should make it crystal-clear in the docs what this timeout is about :)

afrittoli · 2021-12-02T12:10:09Z

docs/taskruns.md

@@ -398,6 +398,8 @@ If a `TaskRun` runs longer than its timeout value, the pod associated with the `
 means that the logs of the `TaskRun` are not preserved. The deletion of the `TaskRun` pod is necessary in order to
 stop `TaskRun` step containers from running.

+The above `timeout` is inclusive of the time used to execute the `TaskRun`'s `retries`.


I disagree that this should be the case. A Task author has no knowledge about retries, and when they set a timeout is definitely specific to the single execution of the Task.
A TaskRun may override that value to a different one specific to the runtime environment (slower/faster execution nodes for instance).

When the Task is executed within a PipelineRun with retries, the PipelineRun author may want a mechanism to control the timeout for a TaskRun and its retries but I don't think we provide that today and I would see that as a new feature.

We do provide today a way for the PipelineRun author to control the maximum amount of time that can be spent in the "non-finally" part of the Pipeline (i.e. timeouts.tasks) which can be used to ensure that finally will run.

afrittoli · 2021-12-02T12:21:05Z

pkg/apis/pipeline/v1beta1/taskrun_types.go

+		}
+		startTime = &metav1.Time{Time: time.Now()}
+	}
+	for _, retry := range tr.Status.RetriesStatus {


I don't think the TaskRun controller should ever use any data from the RetriesStatus.

Retries cannot be defined in a TaskRun, they are a Pipeline feature really.
Since this feature has been implemented in the PipelineRun controller as a TaskRun status reset + new Pod, the RetriesStatus was added to the TaskRun status to hold an history of past execution for the benefit of the PipelineRun controller as well as end users.

I don't know the motivation behind implementing retries like this - I assume there was a good reason for not creating new TaskRuns instead. But the fact remains that the RetriesStatus is written and managed by the PipelineRun controller, and I think they should be completely transparent from a TaskRun controller point of view.

jerop · 2022-01-07T21:41:47Z

/hold

@lbernick is working on refactoring the timeouts logic in #4447

jerop added this to the Pipelines v0.30 milestone Nov 9, 2021

tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Nov 9, 2021

tekton-robot requested review from bobcatfish and vdemeester November 9, 2021 23:41

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 9, 2021

jerop marked this pull request as ready for review November 9, 2021 23:43

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 9, 2021

jerop force-pushed the retries-timeout branch from e40eb05 to eb9ec3b Compare November 9, 2021 23:47

jerop force-pushed the retries-timeout branch from eb9ec3b to 37e06b8 Compare November 10, 2021 00:05

jerop force-pushed the retries-timeout branch from 37e06b8 to 4f2edbf Compare November 10, 2021 01:35

jerop force-pushed the retries-timeout branch from 4f2edbf to 685d152 Compare November 10, 2021 01:54

jerop force-pushed the retries-timeout branch from 685d152 to c3a5e4d Compare November 10, 2021 14:48

lbernick reviewed Nov 10, 2021

View reviewed changes

ghost reviewed Nov 10, 2021

View reviewed changes

tekton-robot assigned ghost Nov 10, 2021

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2021

bobcatfish reviewed Nov 12, 2021

View reviewed changes

jerop modified the milestones: Pipelines v0.30, Pipelines v0.31 Nov 16, 2021

afrittoli reviewed Dec 2, 2021

View reviewed changes

jerop modified the milestones: Pipelines v0.31, Pipelines v0.32 Dec 14, 2021

jerop closed this Jan 7, 2022

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 7, 2022

jerop deleted the retries-timeout branch June 11, 2022 01:45

	func (c Reconciler) createTaskRun(ctx context.Context, rprt resources.ResolvedPipelineRunTask, pr v1beta1.PipelineRun, storageBasePath string, getTimeoutFunc getTimeoutFunc) (v1beta1.TaskRun, error) {
	logger := logging.FromContext(ctx)

	tr, _ := c.taskRunLister.TaskRuns(pr.Namespace).Get(rprt.TaskRunName)
	if tr != nil {
	// Don't modify the lister cache's copy.
	tr = tr.DeepCopy()
	// is a retry
	addRetryHistory(tr)
	clearStatus(tr)
	tr.Status.SetCondition(&apis.Condition{
	Type: apis.ConditionSucceeded,
	Status: corev1.ConditionUnknown,
	})
	logger.Infof("Updating taskrun %s with cleared status and retry history (length: %d).", tr.GetName(), len(tr.Status.RetriesStatus))
	return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{})
	}

	func clearStatus(tr *v1beta1.TaskRun) {
	tr.Status.StartTime = nil
	tr.Status.CompletionTime = nil
	tr.Status.PodName = ""
	}

	func (c Reconciler) ReconcileKind(ctx context.Context, tr v1beta1.TaskRun) pkgreconciler.Event {
	logger := logging.FromContext(ctx)
	ctx = cloudevent.ToContext(ctx, c.cloudEventClient)

	// Read the initial condition
	before := tr.Status.GetCondition(apis.ConditionSucceeded)

	// If the TaskRun is just starting, this will also set the starttime,
	// from which the timeout will immediately begin counting down.
	if !tr.HasStarted() {
	tr.Status.InitializeConditions()
	// In case node time was not synchronized, when controller has been scheduled to other nodes.
	if tr.Status.StartTime.Sub(tr.CreationTimestamp.Time) < 0 {
	logger.Warnf("TaskRun %s createTimestamp %s is after the taskRun started %s", tr.GetNamespacedName().String(), tr.CreationTimestamp, tr.Status.StartTime)
	tr.Status.StartTime = &tr.CreationTimestamp
	}
	// Emit events. During the first reconcile the status of the TaskRun may change twice
	// from not Started to Started and then to Running, so we need to sent the event here
	// and at the end of 'Reconcile' again.
	// We also want to send the "Started" event as soon as possible for anyone who may be waiting
	// on the event to perform user facing initialisations, such has reset a CI check status
	afterCondition := tr.Status.GetCondition(apis.ConditionSucceeded)
	events.Emit(ctx, nil, afterCondition, tr)
	}

	// InitializeConditions will set all conditions in taskRunCondSet to unknown for the TaskRun
	// and set the started time to the current time
	func (trs *TaskRunStatus) InitializeConditions() {
	started := false
	if trs.StartTime.IsZero() {
	trs.StartTime = &metav1.Time{Time: time.Now()}
	started = true
	}
	conditionManager := taskRunCondSet.Manage(trs)
	conditionManager.InitializeConditions()
	// Ensure the started reason is set for the "Succeeded" condition
	if started {
	initialCondition := conditionManager.GetCondition(apis.ConditionSucceeded)
	initialCondition.Reason = TaskRunReasonStarted.String()
	conditionManager.SetCondition(*initialCondition)
	}
	}

Stop executing retried Tasks past their timeouts #4368

Stop executing retried Tasks past their timeouts #4368

Conversation

jerop commented Nov 9, 2021 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Nov 9, 2021

tekton-robot commented Nov 9, 2021

tekton-robot commented Nov 9, 2021

tekton-robot commented Nov 10, 2021

tekton-robot commented Nov 10, 2021

tekton-robot commented Nov 10, 2021

ghost commented Nov 10, 2021

tekton-robot commented Nov 10, 2021

Choose a reason for hiding this comment

jerop Nov 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerop Nov 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pritidesai Nov 11, 2021 • edited by jerop Loading

Choose a reason for hiding this comment

jerop Nov 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Nov 10, 2021

jerop commented Nov 10, 2021

jerop commented Nov 10, 2021

jerop commented Nov 10, 2021

jerop commented Nov 10, 2021

ghost commented Nov 10, 2021

ghost commented Nov 10, 2021

jerop commented Nov 10, 2021

bobcatfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afrittoli commented Dec 2, 2021

afrittoli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerop commented Jan 7, 2022

Stop executing retried `Tasks` past their timeouts #4368

Stop executing retried `Tasks` past their timeouts #4368

jerop commented Nov 9, 2021 •

edited

Loading

jerop Nov 10, 2021 •

edited

Loading

jerop Nov 10, 2021 •

edited

Loading

pritidesai Nov 11, 2021 •

edited by jerop

Loading

jerop Nov 11, 2021 •

edited

Loading