Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskRun retries create extra Pods #1976

Closed
imjasonh opened this issue Jan 29, 2020 · 7 comments
Closed

TaskRun retries create extra Pods #1976

imjasonh opened this issue Jan 29, 2020 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@imjasonh
Copy link
Member

imjasonh commented Jan 29, 2020

Expected Behavior

Retrying a TaskRun N times within a PipelineRun should create at most N Pods to execute each attempt.

Actual Behavior

The TaskRunStatus reports the statuses of N retries, but in reality there are >N Pods created.

The e2e test added in #1975 reports this:

    retry_test.go:111: Found 7 Pods, want 5
    retry_test.go:115: BUG: TaskRunStatus.RetriesStatus did not report pod name "retry-pipeline-retry-me-2cpn8-pod-cjfnn"
    retry_test.go:118: BUG: Pod "retry-pipeline-retry-me-2cpn8-pod-cjfnn" is not failed: Running
    retry_test.go:115: BUG: TaskRunStatus.RetriesStatus did not report pod name "retry-pipeline-retry-me-2cpn8-pod-gpjf4"

The TaskRun was configured to retry 5 times, but created 7 pods (sometimes it's only 6), one of which was still Running at the time the test listed Pods.

Steps to Reproduce the Problem

go test -tags=e2e ./test -run=Retry and observe logs like the above. Change t.Log to t.Error to have the test fail and see a full K8s object dump.

@vdemeester
Copy link
Member

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 29, 2020
@vincent-pli
Copy link
Member

@imjasonh
I try the case, sometimes the issue occurred.
The direct cause is in reconcile of taskrun: in retry case, after create new pod, at the end of reconcile it try to update the taskun but failed with

Failed to update taskRun status, the object has been modified; please apply your changes to the latest version and try again

Then the related taskrun stay in workqueue and reconcile again after increased delay, since podname is still null (update failed last time) then a new pod was created.

I have not figure out why status update error occurred but I think the reconcile logic should enhance to avoid this problem.

BTW, the test case is not correct, please check the PR.

@bobcatfish
Copy link
Collaborator

Fixed by #1996

@ghost
Copy link

ghost commented Feb 6, 2020

Hey, sorry @bobcatfish that PR only updates the test to more accurately reflect the incorrectness described by the bug. The bug still exists :S

@vincent-pli
Copy link
Member

This issue could be closed

@vdemeester
Copy link
Member

Indeed #2022 fixed it
/close

@tekton-robot
Copy link
Collaborator

@vdemeester: Closing this issue.

In response to this:

Indeed #2022 fixed it
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants