Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeouts is not working as expected. #6137

Closed
VeereshAradhya opened this issue Feb 9, 2023 · 16 comments · Fixed by #6171
Closed

timeouts is not working as expected. #6137

VeereshAradhya opened this issue Feb 9, 2023 · 16 comments · Fixed by #6171
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@VeereshAradhya
Copy link

VeereshAradhya commented Feb 9, 2023

Expected Behavior

When provided timeouts for pipeline and task in timeout section of pipelinerun, both pipelinerun and taskrun timeout should get updated

Actual Behavior

When provided timeouts for pipeline and task in timeouts section of pipelinerun, only the pipelinerun timeout is updated and the taskrun gets created with default timeout

Steps to Reproduce the Problem

  1. Create a task and pipeline
  2. Create pipelierun with timeouts section defined

Additional Info

  • Kubernetes version:

    Output of kubectl version:

 Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"archive", BuildDate:"2022-11-11T00:00:00Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
 Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6+263df15", GitCommit:"eddac29feb4bb46b99fb570999324e582d761a66", GitTreeState:"clean", BuildDate:"2023-01-23T21:00:20Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}

  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

Client version: 0.28.0
Pipeline version: v0.41.0
Triggers version: v0.22.0
Operator version: v0.63.0
[varadhya@fedora plumbing]$ oc get tektonconfig config -o yaml | grep timeout 
    default-timeout-minutes: 5
[varadhya@fedora plumbing]$ cat pr.yaml
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: test-timeout-run-
spec:
  timeouts:
    pipeline: 2h
    tasks: 1h
  pipelineRef:
    name: test-timeout
[varadhya@fedora plumbing]$ oc create -f pr.yaml
pipelinerun.tekton.dev/test-timeout-run-n64qf created
[varadhya@fedora plumbing]$ tkn pr describe -L
Name:              test-timeout-run-n64qf
Namespace:         test-timeout
Pipeline Ref:      test-timeout
Service Account:   pipeline
Labels:
 tekton.dev/pipeline=test-timeout

🌡️  Status

STARTED         DURATION   STATUS
6 seconds ago   ---        Running

⏱  Timeouts
 Pipeline:   2h0m0s
 Tasks:      1h0m0s

🗂  Taskruns

 NAME                                    TASK NAME      STARTED         DURATION   STATUS
 ∙ test-timeout-run-n64qf-example-task   example-task   6 seconds ago   ---        Running
[varadhya@fedora plumbing]$ tkn tr describe -L
Name:              test-timeout-run-n64qf-example-task
Namespace:         test-timeout
Task Ref:          example-task
Service Account:   pipeline
Timeout:           5m0s
Labels:
 app.kubernetes.io/managed-by=tekton-pipelines
 tekton.dev/memberOf=tasks
 tekton.dev/pipeline=test-timeout
 tekton.dev/pipelineRun=test-timeout-run-n64qf
 tekton.dev/pipelineTask=example-task
 tekton.dev/task=example-task
Annotations:
 pipeline.tekton.dev/release=e38d112

🌡️  Status

STARTED          DURATION    STATUS
25 seconds ago   ---         Running

⚓ Params

 NAME        VALUE
 ∙ appName   123

🦶 Steps

 NAME          STATUS
 ∙ unnamed-0   Running
@VeereshAradhya VeereshAradhya added the kind/bug Categorizes issue or PR as related to a bug. label Feb 9, 2023
@VeereshAradhya VeereshAradhya changed the title timeouts field not working as expected. timeouts field is not working as expected. Feb 9, 2023
@VeereshAradhya VeereshAradhya changed the title timeouts field is not working as expected. timeouts is not working as expected. Feb 9, 2023
@lbernick
Copy link
Member

lbernick commented Feb 9, 2023

thanks for filing @VeereshAradhya! The "tasks" timeout for a PipelineRun is for the cumulative time taken by all of its tasks. If you'd like to set a timeout for an individual task, you can use pipeline.spec.tasks[].timeout. I'm going to close out this issue but please feel free to reopen if you have follow-up questions.

@lbernick lbernick closed this as completed Feb 9, 2023
@VeereshAradhya
Copy link
Author

The pipeline that I used had only one task. In the below example, I have specified task timeout as 2hours (cumulative) so the taskrun should have timeout of 2hours right?

cat pr.yaml
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: timeout-pipeline-run-
spec:
  timeouts:
    pipeline: 3h
    tasks: 2h
  pipelineRef:
    name: timeout-pipeline
status: {}
[varadhya@192 plumbing]$ tkn pr describe -L
Name:              timeout-pipeline-run-hml2q
Namespace:         test-timeout
Pipeline Ref:      timeout-pipeline
Service Account:   pipeline
Labels:
 tekton.dev/pipeline=timeout-pipeline

🌡️  Status

STARTED          DURATION   STATUS
20 seconds ago   ---        Running

⏱  Timeouts
 Pipeline:   3h0m0s
 Tasks:      2h0m0s

🗂  Taskruns

 NAME                                        TASK NAME      STARTED          DURATION   STATUS
 ∙ timeout-pipeline-run-hml2q-example-task   example-task   20 seconds ago   ---        Running
[varadhya@192 plumbing]$ tkn tr describe -L
Name:              timeout-pipeline-run-hml2q-example-task
Namespace:         test-timeout
Task Ref:          example-task
Service Account:   pipeline
Timeout:           1h0m0s
Labels:
 app.kubernetes.io/managed-by=tekton-pipelines
 tekton.dev/memberOf=tasks
 tekton.dev/pipeline=timeout-pipeline
 tekton.dev/pipelineRun=timeout-pipeline-run-hml2q
 tekton.dev/pipelineTask=example-task
 tekton.dev/task=example-task
Annotations:
 pipeline.tekton.dev/release=e38d112

🌡️  Status

STARTED          DURATION    STATUS
26 seconds ago   ---         Running

⚓ Params

 NAME        VALUE
 ∙ appName   123

🦶 Steps

 NAME          STATUS
 ∙ unnamed-0   Running

@VeereshAradhya
Copy link
Author

VeereshAradhya commented Feb 9, 2023

I tested the same thing on 0.37.5 and the behaviour is different. The taskrun timeout is 2h (1h59m59.990998163s)

$ cat pr.yaml
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: test-timeout-run-
spec:
  timeouts:
    pipeline: 3h
    tasks: 2h
  pipelineRef:
    name: test-timeout
[varadhya@192 plumbing]$ 
[varadhya@192 plumbing]$ tkn pr describe -L
Name:              test-timeout-run-2x9bd
Namespace:         test-timeout
Pipeline Ref:      test-timeout
Service Account:   pipeline
Labels:
 tekton.dev/pipeline=test-timeout

🌡️  Status

STARTED          DURATION   STATUS
21 seconds ago   ---        Running

⏱  Timeouts
 Pipeline:   3h0m0s
 Tasks:      2h0m0s

🗂  Taskruns

 NAME                                    TASK NAME      STARTED          DURATION   STATUS
 ∙ test-timeout-run-2x9bd-example-task   example-task   21 seconds ago   ---        Running
[varadhya@192 plumbing]$ 
[varadhya@192 plumbing]$ tkn tr describe -L
Name:              test-timeout-run-2x9bd-example-task
Namespace:         test-timeout
Task Ref:          example-task
Service Account:   pipeline
Timeout:           1h59m59.990998163s
Labels:
 app.kubernetes.io/managed-by=tekton-pipelines
 tekton.dev/memberOf=tasks
 tekton.dev/pipeline=test-timeout
 tekton.dev/pipelineRun=test-timeout-run-2x9bd
 tekton.dev/pipelineTask=example-task
 tekton.dev/task=example-task
Annotations:
 pipeline.tekton.dev/release=1759eab

🌡️  Status

STARTED          DURATION    STATUS
29 seconds ago   ---         Running

🦶 Steps

 NAME          STATUS
 ∙ unnamed-0   Running
[varadhya@192 plumbing]$ 
[varadhya@192 plumbing]$ 
[varadhya@192 plumbing]$ tkn version
Client version: 0.28.0
Pipeline version: v0.37.5
Triggers version: v0.20.2
Operator version: v0.60.1

@VeereshAradhya
Copy link
Author

/reopen

@tekton-robot tekton-robot reopened this Feb 9, 2023
@tekton-robot
Copy link
Collaborator

@VeereshAradhya: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lbernick
Copy link
Member

lbernick commented Feb 9, 2023

Yes, we changed the behavior of timeouts between the versions you've listed here. We previously applied the time remaining from timeouts.tasks as the timeout to each taskrun. Now, the PipelineRun controller will cancel any running tasks after timeouts.tasks has elapsed. (This helped fix a race condition plus issues with retried taskruns.) The tasks timeout isn't applied to the individual taskrun timeout (even if there's only one) but the taskRuns will still be canceled after the tasks timeout has elapsed.

@vdemeester
Copy link
Member

Oh this is actually confusing now.
When I naively look at timeouts.pipeline and timeouts.tasks, I really thing the one on tasks is for "each" tasks. I asked around a bit (friends, customers, …) and they all thought the same. In addition, what is the difference between timeouts.tasks and timeouts.pipelines now ? If timeout.tasks is "for the cumulative time taken by all of its tasks" and, a pipeline is made out of Tasks, what is the difference with timeouts.pipeline (which is for the whole pipeline, which is the "cumulation" of Tasks) ?

Yes, we changed the behavior of timeouts between the versions you've listed here.

Was there any deprecation or feature-flags here (like when we changed the when behavior) ? It is breaking some of our workloads (and possible our customers as well) because of that behavior change.

@AlanGreene
Copy link
Member

timeouts.tasks is for non-finally tasks, timeouts.pipeline includes both regular tasks and finally tasks
https://tekton.dev/docs/pipelines/pipelineruns/#configuring-a-failure-timeout

The documentation could probably make this clearer that tasks is the cumulative timeout for all non-finally tasks, not each individual task.

@lbernick
Copy link
Member

@vdemeester can you give more detail on what workloads were breaking due to this change? It was introduced in #5134

@vdemeester
Copy link
Member

@lbernick On our downstream pipelines, we used to have a given amount for tasks to timeout (using timeouts.tasks) that was different from the default TaskRun timeout. With the upgrade, now the timeout is the default (which is lower than the one we used to have with timeout.tasks), and thus we have our pipeline failed because of this. It is definitely fixable in the Pipeline/PipelineRun definitions, it's just that it did break some existing behaviour on upgrade.

@lbernick
Copy link
Member

Not sure I understand-- I would expect the pipelinerun controller to time out the tasks after timeouts.tasks elapses, even if that's shorter than the default taskrun timeout. Is there an example PipelineRun that passed prior to these changes that fails now?

@vdemeester
Copy link
Member

Not sure I understand-- I would expect the pipelinerun controller to time out the tasks after timeouts.tasks elapses, even if that's shorter than the default taskrun timeout. Is there an example PipelineRun that passed prior to these changes that fails now?

This is what @VeereshAradhya wrote above to be fair. With timeouts.tasks = 2h, in 0.37.x it would set the TaskRun timeout to 2h, now it sets it to 1h (because it uses the default timeout for all tasks). This breaks because we relied on that timeouts.tasks to set that timeout for 2h for each TaskRun (well here it's only one but still). Now it changed 🙃

@lbernick
Copy link
Member

Ah I see, sorry I didn't quite understand this before! I doubt we'd want to revert these changes; is there anything you'd suggest doing to mitigate the impact? We could definitely edit the release notes to make the impact more clear.

@vdemeester
Copy link
Member

Ah I see, sorry I didn't quite understand this before! I doubt we'd want to revert these changes; is there anything you'd suggest doing to mitigate the impact?

Oh yeah, no really asking to revert the change (yet ? 😈). It's more "for next time we need to be more careful". The "naming" is still a bit confusing to me (timeouts.alltasks or something would have made more sense to me) but, I also don't think that's a huge deal.

We could definitely edit the release notes to make the impact more clear.

👍🏼

@lbernick
Copy link
Member

@VeereshAradhya PTAL at #6171 and let me know if these updated docs are clearer!

@lbernick
Copy link
Member

Also, release notes have been updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants