Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow finally tasks to execute after pipeline timeout #2989

Closed
gpaul opened this issue Jul 22, 2020 · 18 comments
Closed

Allow finally tasks to execute after pipeline timeout #2989

gpaul opened this issue Jul 22, 2020 · 18 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@gpaul
Copy link
Contributor

gpaul commented Jul 22, 2020

Feature request

Tekton recently introduced "Finally Tasks" that execute once a Pipeline succeeds or fails. The design document lists the following use cases:

  • Cleanup cluster resources after finishing (with success/failure) integration tests (Dogfooding Scenario)
  • Update Pull Request with what happened overall in the pipeline (pipeline level)
  • Report Test Results at the end of the test pipeline (Notifications Scenario)

These are all legitimate use cases that show the value of the "Finally Tasks at the Pipeline Level" feature.

However, these tasks will not execute in the event that a Pipeline's execution duration exceeds its Timeout value. Pipelines can have unpredictable or highly variable execution times for various reasons: network transfers, access to third-party services, or delayed execution due to resource contention can all lead to a Pipeline unexpectedly exceeding its configured Timeout.

For this reason, it is currently not possible to rely on Finally tasks for any of the aforementioned use cases. They are an elegant optimisation, but do not provide a contract that can be relied upon in general.

This feature request is to extend the existing Finally Task functionality to allow Finally Tasks to execute even if the enclosing PipelineRun times out. This rounds out the Finally Task feature by addressing the common "PipelineRun timeout" edge case.

Use case

Every reason that one wants Finally Tasks, with the added confidence that they will execute regardless of PipelineRun timeout.

@gpaul gpaul added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 22, 2020
@dlorenc
Copy link
Contributor

dlorenc commented Jul 23, 2020

This is an interesting one - I agree there should be a way to guarantee "finally" runs, but should there be a timeout for that as well?

@vdemeester
Copy link
Member

/cc @bobcatfish @pritidesai

@pritidesai
Copy link
Member

pritidesai commented Jul 28, 2020

This is interesting indeed and I agree, we should be able to guarantee that finally tasks run even the Pipeline (that they part of) timeout. This can be configurable and disabled by default.

Today, pipelineRun timeout takes the highest precedence and exits with failure if its tasks not able to finish in time.

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: pipelinerun-with-timeout
spec:
  timeout: "0h0m3s"
  pipelineSpec:
    tasks:
      - name: echo-good-morning
        timeout: "0h0m30s"
        taskSpec:
          steps:
            - name: echo
              image: ubuntu
              script: |
                #!/usr/bin/env bash
                echo "Good Morning!"
                sleep 60
---

PipelineRun failure:

 tkn pr describe pipelinerun-with-timeout
Name:        pipelinerun-with-timeout
Namespace:   default
Timeout:     3s
Labels:
 tekton.dev/pipeline=pipelinerun-with-timeout

🌡️  Status

STARTED        DURATION    STATUS
1 minute ago   3 seconds   Failed(PipelineRunTimeout)

💌 Message

PipelineRun "pipelinerun-with-timeout" failed to finish within "3s" (TaskRun "pipelinerun-with-timeout-echo-good-morning-4ksv8" failed to finish within "30s")

📦 Resources

 No resources

⚓ Params

 No params

🗂  Taskruns

 NAME                                                 TASK NAME           STARTED        DURATION     STATUS
 ∙ pipelinerun-with-timeout-echo-good-morning-4ksv8   echo-good-morning   1 minute ago   30 seconds   Failed(TaskRunTimeout)

We can have finally task timeout take higher precedence (if specified) and overwrite pipelineRun timeout (default is 1 hour). finally task without any timeout would honor pipelineRun timeout (just like today).

But this option will break the backward compatibility of PipelineRun taking highest precedence and overwriting finally task timeout. If this is something we have to preserve, we can add a separate field finallyTimeout or something similar 🤔

@gpaul
Copy link
Contributor Author

gpaul commented Jul 28, 2020

That seems very elegant.

@vdemeester
Copy link
Member

We can have finally task timeout take higher precedence (if specified) and overwrite pipelineRun timeout (default is 1 hour). finally task without any timeout would honor pipelineRun timeout (just like today).

But this option will break the backward compatibility of PipelineRun taking highest precedence and overwriting finally task timeout. If this is something we have to preserve, we can add a separate field finallyTimeout or something similar thinking

I really like that idea 👼 I wonder how much it breaks the backward compatibility. How much users do use timeout in finally, … But I would love to have this 🙃

@pritidesai
Copy link
Member

finally task timeout take higher precedence (if specified) and overwrite PipelineRun timeout

+1

@bobcatfish @afrittoli thoughts?

@dlorenc
Copy link
Contributor

dlorenc commented Jul 29, 2020

Just to makes sure I understand this - is this the idea?

  • A pipeline has a timeout set to 1 hour
  • It has a finally Task`, that has a finallyTimeout of 10 minutes.
  • The pipeline runs for an hour, and is terminated.
  • The finally task then gets executed, and can run for 10 minutes?

@pritidesai
Copy link
Member

Yes.

Pipeline schedules finally after all non-finally are done within that hour otherwise non-finally tasks are terminated after an hour and finally runs for 10 minutes.

The idea is finallyTimeout does not modify timeout for non-finally tasks but confirm the execution of finally tasks for the specified amount of time.

@dlorenc
Copy link
Contributor

dlorenc commented Aug 1, 2020

+1 to that!

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2020
@pritidesai
Copy link
Member

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2020
@jerop
Copy link
Member

jerop commented Jan 26, 2021

/assign @souleb

thank you!!

@tekton-robot
Copy link
Collaborator

@jerop: GitHub didn't allow me to assign the following users: souleb.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @souleb

thank you!!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@souleb
Copy link
Contributor

souleb commented Jan 26, 2021

/assign

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2021
@jerop
Copy link
Member

jerop commented Apr 26, 2021

/remove-lifecycle stale

@souleb is actively working on this:
TEP: https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md
PR: #3843

thank you @souleb 😁

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2021
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2021
@jerop
Copy link
Member

jerop commented Oct 15, 2021

@jerop jerop closed this as completed Oct 15, 2021
jerop added a commit to jerop/pipeline that referenced this issue Nov 9, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for their allocated
timeout. As such, the execution of `Finally Tasks` is not guaranteed
even when a `finally timeout` has been allocated.

As described by @ornew in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This issue is caused by the the interaction between retries and timeouts.
When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful so we can record the `StartTime` of
each retry in the `RetriesStatus`. However, using it as the reset start
time of the `TaskRun` to check whether it has timedout is incorrect and
causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, but then
it'll be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <lbernick@google.com>
jerop added a commit to jerop/pipeline that referenced this issue Nov 9, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This prblem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <lbernick@google.com>
jerop added a commit to jerop/pipeline that referenced this issue Nov 10, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This problem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <lbernick@google.com>
jerop added a commit to jerop/pipeline that referenced this issue Nov 10, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This problem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <lbernick@google.com>
jerop added a commit to jerop/pipeline that referenced this issue Nov 10, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This problem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <lbernick@google.com>
jerop added a commit to jerop/pipeline that referenced this issue Nov 10, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This problem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <leebernick@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

7 participants