Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate cancel and timeout logic #2365

Merged
merged 1 commit into from
Apr 16, 2020

Conversation

afrittoli
Copy link
Member

Cancel and timeout do very similar things when they happen: they
update the status of the taskrun, set the completion time and try
and delete the pod.

Today this is done for the two cases in different places, the code
structured differently and the behaviour slightly different:

  • log levels of the messages are different
  • cancel does not set the completion time
  • cancel does not check if the error on pod deletion is a NotFound

This commit introduces "HasTimedOut" to tasktun_types, which
matches what "IsCancelled" does. It introduces a "killTaskRun"
function that can be used by both cancel and timeout, with the
only different being the "Reason" and termination message.
The timeout_check module is not necessary anymore.

The check for IsCancelled and HasTimedOut are move out of
"reconcile" into "Reconcile", so that now "Reconcile" checks:

  • HasStarted
  • isDone
  • IsCancelled
  • HasTimedOut
    and finally, if applicable, it invokes "reconcile".

Changes

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Describe any user facing changes here, or delete this block.

Examples of user facing changes:
- API changes
- Bug fixes
- Any changes in behavior

@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Apr 10, 2020
@tekton-robot tekton-robot requested review from abayer and dibyom April 10, 2020 11:24
@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 10, 2020
@afrittoli afrittoli changed the title Consolidate cancel and timeout logic WIP Consolidate cancel and timeout logic Apr 10, 2020
@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 90.5% 61.3% -29.2

@afrittoli afrittoli force-pushed the refactor_taskrun_controller branch 2 times, most recently from 882452e to 9e224cc Compare April 15, 2020 07:35
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 90.5% 86.1% -4.4

@afrittoli afrittoli force-pushed the refactor_taskrun_controller branch from 9e224cc to 09eed05 Compare April 15, 2020 08:10
@afrittoli afrittoli changed the title WIP Consolidate cancel and timeout logic Consolidate cancel and timeout logic Apr 15, 2020
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 90.5% 86.1% -4.4

@@ -168,74 +191,15 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error {
return multierror.Append(merr, c.updateStatusLabelsAndAnnotations(tr, original)).ErrorOrNil()
}

func (c *Reconciler) updateStatusLabelsAndAnnotations(tr, original *v1alpha1.TaskRun) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just moved down, no changes

@@ -465,67 +498,38 @@ func (c *Reconciler) handlePodCreationError(tr *v1alpha1.TaskRun, err error) {
c.Logger.Errorf("Failed to create build pod for task %q: %v", tr.Name, err)
}

func updateTaskRunResourceResult(taskRun *v1alpha1.TaskRun, podStatus corev1.PodStatus) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just moved down - no changes.
I consolidated all func (c *Reconciler) together.

@afrittoli afrittoli force-pushed the refactor_taskrun_controller branch from 09eed05 to 44cd91f Compare April 15, 2020 09:16
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 90.5% 86.1% -4.4

@afrittoli afrittoli force-pushed the refactor_taskrun_controller branch from 44cd91f to 2eb7777 Compare April 15, 2020 12:21
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 90.5% 86.1% -4.4
pkg/apis/pipeline/v1beta1/taskrun_types.go 71.4% 67.6% -3.8

if tr.IsCancelled() {
before := tr.Status.GetCondition(apis.ConditionSucceeded)
message := fmt.Sprintf("TaskRun %q was cancelled", tr.Name)
err := c.failTaskRun(tr, "TaskRunCancelled", message)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: suggest using the constant TaskRunSpecStatusCancelled here instead of literal string "TaskRunCancelled". Or putting it into a named constant if you want to keep task.spec.status conceptually separate from task.status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll fix that, thank you

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 15, 2020
@afrittoli
Copy link
Member Author

afrittoli commented Apr 15, 2020

I'm hitting two data races in the unit tests - not tests that I've touched - but perhaps the new logic exposes or creates a new issue.

@afrittoli afrittoli force-pushed the refactor_taskrun_controller branch from 2eb7777 to 4ea9f58 Compare April 15, 2020 19:57
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 90.5% 86.1% -4.4
pkg/apis/pipeline/v1beta1/taskrun_types.go 71.4% 72.7% 1.3

@afrittoli afrittoli force-pushed the refactor_taskrun_controller branch from 4ea9f58 to cb42966 Compare April 15, 2020 20:43
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 90.5% 86.1% -4.4
pkg/apis/pipeline/v1beta1/taskrun_types.go 71.4% 72.7% 1.3

@afrittoli afrittoli force-pushed the refactor_taskrun_controller branch from cb42966 to 86b13b6 Compare April 16, 2020 08:25
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 84.6% 86.1% 1.5
pkg/apis/pipeline/v1beta1/taskrun_types.go 69.7% 72.7% 3.0

@@ -120,6 +121,17 @@ func (trs *TaskRunStatus) MarkResourceNotConvertible(err *CannotConvertError) {
})
}

// MarkResourceFailed sets the ConditionSucceeded condition to ConditionFalse
// based on an error that occurred and a reason
func (trs *TaskRunStatus) MarkResourceFailed(reason string, err error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should be in alpha too...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TaskRunStatus defines the observed state of TaskRun
type TaskRunStatus = v1beta1.TaskRunStatus

It doesn't as it's just an alias in v1alpha1 😉

Cancel and timeout do very similar things when they happen: they
update the status of the taskrun, set the completion time and try
and delete the pod.

Today this is done for the two cases in different places, the code
structured differently and the behaviour slightly different:
- log levels of the messages are different
- cancel does not set the completion time
- cancel does not check if the error on pod deletion is a NotFound

This commit introduces "HasTimedOut" to tasktun_types, which
matches what "IsCancelled" does. It introduces a "killTaskRun"
function that can be used by both cancel and timeout, with the
only different being the "Reason" and termination message.
The timeout_check module is not necessary anymore.

The check for IsCancelled and HasTimedOut are move out of
"reconcile" into "Reconcile", so that now "Reconcile" checks:
- HasStarted
- isDone
- IsCancelled
- HasTimedOut
and finally, if applicable, it invokes "reconcile".
@afrittoli afrittoli force-pushed the refactor_taskrun_controller branch from 86b13b6 to 279957f Compare April 16, 2020 09:16
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 84.6% 86.1% 1.5
pkg/apis/pipeline/v1beta1/taskrun_types.go 69.7% 72.7% 3.0

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 16, 2020
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbwsg, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@afrittoli
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

@afrittoli
Copy link
Member Author

The failure in the integration tests is on pipeline-task retry.
It doesn't seem to be related to this patch and I've not been able to reproduce the issue locally, so retesting. I wonder where the extra pod comes from though, there may be a race condition somewhere.

@afrittoli
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

@tekton-robot tekton-robot merged commit 6b1579c into tektoncd:master Apr 16, 2020
@afrittoli afrittoli added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Apr 30, 2020
return runtime > timeout
}

func (tr *TaskRun) GetTimeout() time.Duration {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @afrittoli are you sure we need this function? looking at the taskrun defaulting logic it looks like we default the value of the timeout:

https://github.com/tektoncd/pipeline/blob/master/pkg/apis/pipeline/v1beta1/taskrun_defaults.go#L47

also we don't have the same function for pipelineruns, and finally finally apisconfig.DefaultTimeoutMinutes is a default value afaik; but the user can actually override it with a config map so we probably want to use that instead (there is some similar logic in the timeout logic that i think is being overly cautious - its always hard to know when you can rely on the defaulting to be called)

// tr.Status.PodName will be empty if the pod was never successfully created. This condition
// can be reached, for example, by the pod never being schedulable due to limits imposed by
// a namespace's ResourceQuota.
err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Delete(tr.Status.PodName, &metav1.DeleteOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if some kind of if statement should have been placed here to guard against scenarios when a TaskRun reason is v1beta1.TaskRunReasonTimedOut or v1beta1.TaskRunCancelled. As of now, this is deleting pods for TaskRuns that have timed out or been cancelled, which amongst other things, deletes the TaskRun logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants