Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0046: Finally tasks execution post pipelinerun timeout #326

Merged
merged 1 commit into from
Apr 15, 2021

Conversation

souleb
Copy link
Contributor

@souleb souleb commented Jan 28, 2021

Proposal to enable finally tasks to execute when a pipelinerun has reached timeout.

Add a new flag tasksTimeouts which will define a timeout for the dag tasks. The finally tasks timeout will be timeout - tasksTimeout with timeout >= tasksTimeout and timeoutbeing the current timeout flag.

When tasksTimeout is not defined, timeout is used for the tasks timeout (the behavior is unchanged).

This will enable users to manage run time behavior and make sure their finally tasks run as intended by scoping the tasks runtime period.

/kind tep

cc @jerop @pritidesai

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). labels Jan 28, 2021
@ghost
Copy link

ghost commented Jan 29, 2021

Running Finally on pipeline timeout makes sense to me. Only question to my mind is whether we need a variable or status (or something similar) that the Finally task can check to see if a Pipeline timeout occurred. e.g. For reporting purposes. $(context.pipeline.timedOut) or similar? But I think this could be a follow-on TEP if there is user demand.

/lgtm

@tekton-robot tekton-robot assigned ghost Jan 29, 2021
@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 29, 2021
@souleb
Copy link
Contributor Author

souleb commented Jan 29, 2021

Running Finally on pipeline timeout makes sense to me. Only question to my mind is whether we need a variable or status (or something similar) that the Finally task can check to see if a Pipeline timeout occurred. e.g. For reporting purposes. $(context.pipeline.timedOut) or similar? But I think this could be a follow-on TEP if there is user demand.

/lgtm

Yeah I think that is logical next step.

@pritidesai
Copy link
Member

$(context.pipeline.timedOut)

we could be little more specific with $(context.tasks.timedOut).

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, I think it's a very good idea.
There are some details to be clarified in terms of backward compatibility and introducing this feature as alpha first.

teps/0047-finallytask-execution-post-timeout.md Outdated Show resolved Hide resolved
teps/0047-finallytask-execution-post-timeout.md Outdated Show resolved Hide resolved
teps/0047-finallytask-execution-post-timeout.md Outdated Show resolved Hide resolved
teps/0047-finallytask-execution-post-timeout.md Outdated Show resolved Hide resolved
nitty-gritty.
-->

Enable finally task to run when a pipeline times out. This implies a behavioral change, as finally tasks will run no matter what.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to put this change of behaviour behind a feature flag, which will be default preserve the current behaviour. I think a possible solution could be as follows:

  • when disabled (default), the new API field is not accepted via validation. The timeout is considered as an overall pipeline timeout. If it happens before finally, finally does not run, if it happens during finally, finally is interrupted and fails.
  • when enabled, the new API field is accepted via validation. The timeout is not applied to the pipeline excluding finally.

Alternatively, we could have two flags, one which controls the API change and one that controls how timeout is handled, to provide backward compatibility. In this scenario:

  • if the API change is enabled (default to disabled), a finally timeout can be specified, but the overall behaviour depends by the other flag
  • if the timeout backward compatibility flag is enabled (default to true), the timeout specified today applies to the overall pipeline. If a finally timeout is specified, the main pipeline will timeout after (overall timeout - finally timeout)

This second approach might make sense if we have a single flag to control all alpha API flag. One might want to enable alpha API parts but still maintain the current behaviour for pipeline timeouts.
@bobcatfish @vdemeester ^^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, because this is a change in behaviour, we need to to preverve the current behaviour for a given amount of time.
What we are adding here is:

  • A new field for timeout in finally task (with a default that could be the same default as for TaskRun)
  • A change in behaviour so that the pipelinerun timeout doesn't apply to finally task.

I feel both can be seen as separate, as, we could have the 2nd without the first (just relying on the default taskrun timeout for finally task). At the same time, the 1st could be implemented independently of the behaviour change. Hence I am more on a flag on the controller to change the behavior, and add the timeout field under a general "alpha" feature flag — which is the 2nd approach @afrittoli.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand here. Relying on the taskrun default timeout is in itself a behavioral change. So in based on the @afrittoli proposition 2 flags are still needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@souleb maybe the missing detail is that in TEP-0033 we are discussing adding an "alpha" feature flag that will need to be enabled to use all new alpha features, including specifying a timeout for finally tasks

i agree with @vdemeester and @afrittoli that it there are 2 separate features here and we can approach them differently

@pritidesai
Copy link
Member

@chhsia0 it will be great to have your eyes on this 🙏
/assign @chhsia0

@tekton-robot
Copy link
Contributor

@pritidesai: GitHub didn't allow me to assign the following users: chhsia0.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

@chhsia0 it will be great to have your eyes on this 🙏
/assign @chhsia0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 2, 2021
@souleb
Copy link
Contributor Author

souleb commented Feb 2, 2021

Thanks for this, I think it's a very good idea.
There are some details to be clarified in terms of backward compatibility and introducing this feature as alpha first.

In terms of backward compatibility, as the new field would be optional, the only thing I can think of is the change in the behavior. Like After the implementation of this TEP, the finally tasks always run. Do you think of anything else?

Base automatically changed from master to main February 3, 2021 16:34
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment on how to do this change in a backward compatible way, but other than that, looks good to me.

nitty-gritty.
-->

Enable finally task to run when a pipeline times out. This implies a behavioral change, as finally tasks will run no matter what.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, because this is a change in behaviour, we need to to preverve the current behaviour for a given amount of time.
What we are adding here is:

  • A new field for timeout in finally task (with a default that could be the same default as for TaskRun)
  • A change in behaviour so that the pipelinerun timeout doesn't apply to finally task.

I feel both can be seen as separate, as, we could have the 2nd without the first (just relying on the default taskrun timeout for finally task). At the same time, the 1st could be implemented independently of the behaviour change. Hence I am more on a flag on the controller to change the behavior, and add the timeout field under a general "alpha" feature flag — which is the 2nd approach @afrittoli.

@pritidesai
Copy link
Member

/assign pritidesai

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the work on this @souleb 🙏🏾

adding a consideration brought up at the api wg meeting today

the possible solutions so far are modifying the meaning of pipelinerun timeout to be dag tasks timeout and it may be confusing that when pipelinerun has reached its timeout, the finally tasks are executed afterwards. an alternative we could consider is that pipelinerun timeout is inclusive of the finally tasks timeout. so, during execution, we could stop executing dag tasks at some point to give enough time for finally tasks to execute before timing out the pipelinerun (dag tasks timeout = pipelinerun timeout - finally tasks timeout). this could also be as confusing, but it may be worthwhile to consider it and add it to the alternatives section

@pritidesai
Copy link
Member

pritidesai commented Feb 9, 2021

Based on the discussion with @sbwsg, @vdemeester, @afrittoli, @jerop, and @skaegi in the API WG 0n 02/08, propose two separate features:

  1. Introduce a new API feature flag (boolean) to opt in to always execute finally tasks. This flag will make sure that the finally task is not bound to pipelineRun timeout and is always executed until it finishes with success/failure or until its own timeout is met.

  2. One more flag indicating that the finally tasks are getting grace period (by default 0) and not bound by the pipelineRun timeout. Discussed two options for this:

Option A:

This can be a new flag at the pipelineRun level finallyTimeout similar to the timeout flag. If specified, pipelineRun timeout (default is one hour) applies to dag tasks only. The dag tasks will stop executing once it meets the pipelineRun timeout. The finally tasks starts executing at this point and will be executed until meets the timeout specified in finallyTimeout.

Option B:

pipelineTask timeout in finally tasks take higher precedence over pipelineRun timeout by introducing a new boolean flag either at the pipelineRun level or API feature flag which actually I think is equivalent to 1. But such flag as the feature flag would apply to all the pipeline runs and can not just be applied to a few pipelines.

Option C:

Deprecate timeout field in pipelineRun. Introduce two new flags, tasksTimeout (applies to only dag) and runTimeout (applies to the entire run including dag and finally) must be always > taskTimeout. These two new flags can be driven by the API feature flag until it hits the beta support. Also, finally tasks are part of the same pipelineRun as dag tasks. When run times out, the entire pipeline stops executing.

I think the simplest would be option C.

@gpaul This proposal is the addressing the feature you requested, please provide your feedback, if possible 🙏

@pritidesai
Copy link
Member

it seems we already have that inconsistency, in that task timeout is defined at authoring time only while pipeline timeout is defined at runtime only

Like we discussed in the API WG today, it does look like there is inconsistency but its implicit i.e. timeout at the pipeline spec level is the sum of all the tasks and finally tasks.

@ghost
Copy link

ghost commented Mar 30, 2021

authoring time: as the author of a task/pipeline, I want to set the timeout for the task/pipeline based on my implementation and understanding of the task/pipeline (e.g. task timeout)

Step timeouts are another example of this at the Task-level too.

we may want to eventually (maybe before v1) support configuring all timeouts at both authoring time and runtime (with runtime overriding authoring time if both are specified, i suppose) to meet tekton reusability principle 3

Yeah I am on board with this idea. Both parties (author and runner) can justify controlling timeouts I think, since there's usage-specific context on both sides.

@souleb
Copy link
Contributor Author

souleb commented Apr 2, 2021

The last api wg was too late for me 😞 I'd really like to move forward with this TEP. I think that everyone was onboard after the last demo... So what I propose is to go on with the taskstimeouts for this TEP and add a section with a link to a new TEP addressing the aggregated timeouts dict and timeout deprecation.

@ghost
Copy link

ghost commented Apr 2, 2021

So what I propose is to go on with the taskstimeouts for this TEP and add a section with a link to a new TEP addressing the aggregated timeouts dict and timeout deprecation.

I think that's a good idea - it gives us a chance to weigh the pros and cons of the timeouts revamp in isolation without blocking this feature.

/approve

What do others think?

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2021
@jerop
Copy link
Member

jerop commented Apr 2, 2021

agreed, let's focus on this small change for now and can revisit the others later 👍🏾

/approve

(needs a non-googler to lgtm)

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerop, sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jerop
Copy link
Member

jerop commented Apr 2, 2021

@souleb please squash the commits (we prefer one commit per PR) and resolve the tep linting issue

@souleb souleb force-pushed the finallyTimeout branch 3 times, most recently from e954b59 to b5a32de Compare April 2, 2021 20:14
@souleb souleb changed the title TEP-0047: Finally tasks execution post pipelinerun timeout TEP-0046: Finally tasks execution post pipelinerun timeout Apr 2, 2021

```yaml
spec:
tasksTimeout: "1h0m0s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @souleb, I am confused 😕 What would happen to tasksTimeout when we have follow on work?

Copy link
Contributor Author

@souleb souleb Apr 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be deprecated.

spec:
  timeout:
    pipeline: "0h4m0s"
    tasks: "0h1m0s"
    finally: "0h3m0s"

The tasks key in timeout would replace it, with same behavior. The current timeout would be replaced as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the current proposal because it's small and it only adds functionality, but I don't like the idea of adding a new field if we plan to deprecate it already. Is there any reason for not using the final syntax right away?

spec:
  timeout: "0h4m0s" # Existing field, current behaviour
  timeouts:
    tasks: "0h1m0s" # New field

@pritidesai
Copy link
Member

pritidesai commented Apr 12, 2021

Thanks a bunch @souleb for all your hard work. We appreciate your patience 🙏

Here is the list of items I have collected and re-written the proposal:

ToDo:

  • Set the status to implementable instead of proposed. We have discussed this many times and have agreement from most of the owners going in this direction.
  • Move the follow-on work to the proposal.

Proposal

Introduce a new section timeouts as part of the pipelineRun CRD:

kind: PipelineRun
spec:
  timeouts:
    pipeline: "0h4m0s"
    tasks: "0h1m0s"
    finally: "0h3m0s"
  pipelineSpec:
    tasks:
    - name: tests
      taskRef:
        Name: integration-test
    finally:
    - name: cleanup-test
      taskRef:
        Name: cleanup 

This new section can be used to specify timeouts for each section tasks and finally separately and overall pipeline level timeout. If specified, this section must at least contain one sub-section. It can also contain a combination of any two sub-sections or all three sub-sections at the same time.

Pipeline Timeout

The users have an ability to specify the timeout of the entire pipeline. The value specified in the following section will overwrite the default pipeline timeout. The default pipeline timeout is configurable via ConfigMap default-timeout-minutes. This specification is equivalent to the traditional pipeline level timeout specified in the pipelineRun CRD using spec.timeout.

kind: PipelineRun
spec:
  timeouts:
    pipeline: "0h4m0s"

Tasks Timeout

The users have an ability to specify the timeout for the tasks section. The value specified here is restricted to the tasks section and also implicitly derives the timeout for the finally section. The timeout for the finally section would be equivalent to pipeline timeout (default-timeout-minutes if pipeline timeout not specified) - tasks timeout i.e. all tasks are terminated after 1 minute, the finally tasks are executed and terminated after 59 minutes.

kind: PipelineRun
spec:
  timeouts:
    tasks: "0h1m0s"

Finally Timeout

The users have an ability to specify the timeout for the finally section. The value specified here is restricted to the finally section and also implicitly derives the timeout for the tasks section i.e. the timeout for the tasks section would be equivalent to pipeline timeout (default-timeout-minutes if pipeline timeout not specified) - finally timeout.

kind: PipelineRun
spec:
  timeouts:
    finally: "0h3m0s"

Combination of Timeouts

The users have an ability to specify the timeout of the entire pipeline and restrict some portion of it to either tasks section or finally section.

Combination 1: Set the timeout for the entire pipeline and reserve a portion of it for tasks.

kind: PipelineRun
spec:
  timeouts:
    pipeline: "0h4m0s"
    tasks: "0h1m0s"

Combination 2: Set the timeout for the entire pipeline and reserve a portion of it for finally.

kind: PipelineRun
spec:
  timeouts:
    pipeline: "0h4m0s"
    finally: "0h3m0s"

Some of the validations being done as part of the creation of pipelineRun CRD:

  1. Users can either specify the traditional timeout field spec.timeout or this new section spec.timeouts. Specifying both fields are restricted.
  2. With this new section, the amount of timeouts in tasks and finally must be less than the pipeline timeout. If both specified, the sum of the tasks and the finally must match the pipeline timeout.

This new section spec.timeouts will not be available without feature flag and initial implementation will be considered as alpha level support. Once this new section is ready to be promoted to beta or stable, we will be able to start the deprecation process of the spec.timeout.

This kind of proposal allows us to smoothly transition away from the old usage to this new section.

@souleb souleb force-pushed the finallyTimeout branch 2 times, most recently from 05d1be4 to 7ab7aa5 Compare April 14, 2021 22:18
@souleb
Copy link
Contributor Author

souleb commented Apr 14, 2021

@pritidesai all done. Thanks!

@pritidesai
Copy link
Member

thanks @souleb, looks good to me, please fix the linting failure:

-|[TEP-0046](0046-finallytask-execution-post-timeout.md) | Finally tasks execution post pipelinerun timeout | proposed | 2021-04-14 |
+|[TEP-0046](0046-finallytask-execution-post-timeout.md) | Finally tasks execution post pipelinerun timeout | implementable | 2021-04-14 |

tasksTimeout defines a timeout for the dag tasks. The finally tasks timeout willbe timeout - tasksTimeout with timeout >= tasksTimeout.

When tasksTimeout is not defined, the behavior is unchanged.

This will enable users to manage run time behavior and make sure their finally tasks run as intended by scoping the tasks runtime period.
Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates. @pritidesai I think all your comments have been addressed?
Just a few nits - but nothing blocking.
/lgtm

Comment on lines +31 to +43
<!--
This section is incredibly important for producing high quality user-focused
documentation such as release notes or a development roadmap. It should be
possible to collect this information before implementation begins in order to
avoid requiring implementors to split their attention between writing release
notes and implementing the feature itself.
A good summary is probably at least a paragraph in length.
Both in this section and below, follow the guidelines of the [documentation
style guide]. In particular, wrap lines to a reasonable length, to make it
easier for reviewers to cite specific portions, and to minimize diff churn on
updates.
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: could you cleanup the comments (can be a follow-up)


Some of the validations being done as part of the creation of `pipelineRun` CRD:
1. Users can either specify the traditional timeout field `spec.timeout` or this new section `spec.timeouts`. Specifying both fields are restricted.
2. With this new section, the amount of timeouts in `tasks` and `finally` must be less than the pipeline timeout. If both specified, the sum of the `tasks` and the `finally` must match the pipeline timeout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I'm not fully convinced we need to enforce an exact match.
I understand it might be useful for users to be notified if things don't match up, but I wonder if in future timeouts might be set in different resources, and ensuring they match up might be difficult.
Anyways, something we can discuss on the PR, not blocking here.

Comment on lines +231 to +247
```yaml
kind: PipelineRun
spec:
timeouts:
pipeline: "0h4m0s"
tasks: "0h1m0s"
```

Combination 2: Set the timeout for the entire `pipeline` and reserve a portion of it for `finally`.

```yaml
kind: PipelineRun
spec:
timeouts:
pipeline: "0h4m0s"
finally: "0h3m0s"
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the following is valid too?

kind: PipelineRun
spec:
  timeouts:
   tasks: "0h1m0s"
    finally: "0h3m0s"


### Finally Timeout flag at Pipelinerun Spec

We could add a new flag at the pipelineRun level `finallyTimeout` similar to the timeout flag. If specified, pipelineRun timeout (default is one hour) applies to dag tasks only. The dag tasks will stop executing once it meets the pipelineRun timeout. The finally tasks starts executing at this point and will be executed until meets the timeout specified in finallyTimeout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I think the cons of this approach is that the timeout pipeline has different meaning depending on whether finallyTimeout is specified or not, which may be confusing to users.
It as the plus of not requiring us to deprecate the existing field.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2021
@tekton-robot tekton-robot merged commit 481ef7b into tektoncd:main Apr 15, 2021
@abayer
Copy link
Contributor

abayer commented Dec 13, 2021

Given that the related issue (tektoncd/pipeline#2989) is closed, should this be marked as implemented?

@souleb souleb deleted the finallyTimeout branch December 14, 2021 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Implemented
Development

Successfully merging this pull request may close these issues.

8 participants