Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow imagePullBackOff for the specified duration #7666

Merged
merged 1 commit into from
Feb 15, 2024

Conversation

pritidesai
Copy link
Member

@pritidesai pritidesai commented Feb 14, 2024

Changes

We have implemented imagePullBackOff to fail fast. The issue with this approach is, this can be a transient error depending on the infrastructure. Often times the node where the pod is scheduled experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff) compared to other authentication failure, missing image, etc. In case of a rate limit, the pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. But the fail fast approach results in a taskRun failure and hence pipelineRun results in a failure.

Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, controller returns a permanent failure.

#5987
#7184

/kind feature

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Configure default-imagepullbackoff-timeout to allow imagePullBackOff to retry and wait for the specified duration before failing the pipeline.

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 14, 2024
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 14, 2024
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/default.go 91.2% 85.5% -5.7
pkg/reconciler/taskrun/taskrun.go 86.6% 83.9% -2.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/default.go 91.2% 85.5% -5.7
pkg/reconciler/taskrun/taskrun.go 86.6% 83.9% -2.7

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM but I think we should use a duration type instead of just a number (that represent minutes). It's more "user-friendly" in my opinion.

@@ -87,6 +87,10 @@ data:
# no default-resolver-type is specified by default
default-resolver-type:

# default-imagepullbackoff-timeout contains the default number of minutes to wait
# before requeuing the TaskRun to retry
# default-imagepullbackoff-timeout: "5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to use time values instead, like 1m, 5m, 10s or 1h.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we need to add "minutes" in the field name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at this too and thinking that in all cases I can think of K8s uses "seconds". I think that's simplest vs. supporting something fancier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have strong opinion either ways. I like the time values as it is more clear and flexible such that the timeout can be specified in seconds, minutes, hours, etc. But at the same time, it looses consistency with the existing taskRun timeout field which uses minutes.

We have the time.Duration implemented now, do we want to change that to adding minutes in the field name or use seconds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point -- we might want to keep our time units consistent across the board in the future and I think time.Duration is truly better basis.

pkg/apis/config/default.go Show resolved Hide resolved
@afrittoli afrittoli self-assigned this Feb 14, 2024
@@ -222,10 +224,30 @@ func (c *Reconciler) ReconcileKind(ctx context.Context, tr *v1.TaskRun) pkgrecon
return nil
}

func (c *Reconciler) checkPodFailed(tr *v1.TaskRun) (bool, v1.TaskRunReason, string) {
func (c *Reconciler) checkPodFailed(tr *v1.TaskRun, ctx context.Context) (bool, v1.TaskRunReason, string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K8s API convention is that "context" should be the first parameter if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s even Go conventions 😇

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 14, 2024
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/default.go 91.2% 91.9% 0.7
pkg/reconciler/taskrun/taskrun.go 86.6% 87.2% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/default.go 91.2% 91.9% 0.7
pkg/reconciler/taskrun/taskrun.go 86.6% 87.2% 0.6

We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

Signed-off-by: Priti Desai <pdesai@us.ibm.com>

wait for a given duration in case of imagePullBackOff

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
@JeromeJu JeromeJu self-assigned this Feb 14, 2024
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/default.go 91.2% 91.9% 0.7
pkg/reconciler/taskrun/taskrun.go 86.6% 87.2% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/default.go 91.2% 91.9% 0.7
pkg/reconciler/taskrun/taskrun.go 86.6% 87.2% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/default.go 91.2% 91.9% 0.7
pkg/reconciler/taskrun/taskrun.go 86.6% 87.2% 0.6

@pritidesai pritidesai added this to the Pipelines v0.57 milestone Feb 15, 2024
@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2024
Copy link
Member

@JeromeJu JeromeJu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pritidesai

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JeromeJu, skaegi, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JeromeJu,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@JeromeJu
Copy link
Member

/lgtm
Thanks for supporting this @pritidesai

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 15, 2024
pritidesai added a commit to pritidesai/pipeline that referenced this pull request Feb 15, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

This is a manual cheery-pick of tektoncd#7666

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
@tekton-robot tekton-robot merged commit fd17c74 into tektoncd:main Feb 15, 2024
13 checks passed
Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pritidesai
Just a couple of minor things, maybe for a follow up PR

name: config-defaults
namespace: tekton-pipelines
data:
default-imagepullbackoff-timeout: "5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example needs to be updated to 5m as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @afrittoli - ptal - #7679. Thanks!

if imagePullBackOffTimeOut.Seconds() != 0 {
p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{})
if err != nil {
message := fmt.Sprintf(`The step %q in TaskRun %q failed to pull the image %q and the pod with error: "%s."`, step.Name, tr.Name, step.ImageID, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: error messages should start in lowercase

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @afrittoli - ptal - #7679. Thanks!

Comment on lines +260 to +279
if sidecar.Waiting.Reason == ImagePullBackOff {
imagePullBackOffTimeOut := config.FromContextOrDefaults(ctx).Defaults.DefaultImagePullBackOffTimeout
// only attempt to recover from the imagePullBackOff if specified
if imagePullBackOffTimeOut.Seconds() != 0 {
p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{})
if err != nil {
message := fmt.Sprintf(`The sidecar %q in TaskRun %q failed to pull the image %q and the pod with error: "%s."`, sidecar.Name, tr.Name, sidecar.ImageID, err)
return true, v1.TaskRunReasonImagePullFailed, message
}
for _, condition := range p.Status.Conditions {
// check the pod condition to get the time when the pod was scheduled
// keep trying until the pod schedule time has exceeded the specified imagePullBackOff timeout duration
if condition.Type == corev1.PodScheduled {
if c.Clock.Since(condition.LastTransitionTime.Time) < imagePullBackOffTimeOut {
return false, "", ""
}
}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: this might be a reusable function instead of having the same code twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a minor difference, both operating on different datatype. Its a little tricky as one has reference to step while other has a reference to sidecar.

@pritidesai
Copy link
Member Author

Thank you @vdemeester @skaegi @JeromeJu @afrittoli for the reviews!

We just upgraded our deployment to 0.53 and would like to cherry pick this in 0.53 and 0.56. Thoughts?

@pritidesai
Copy link
Member Author

/cherry-pick release-v0.56.x

@tekton-robot
Copy link
Collaborator

@pritidesai: new pull request created: #7678

In response to this:

/cherry-pick release-v0.56.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pritidesai added a commit to pritidesai/pipeline that referenced this pull request Feb 15, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

tektoncd#5987
tektoncd#7184

This is a manual cheery-pick of tektoncd#7666

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
tekton-robot pushed a commit that referenced this pull request Feb 26, 2024
We have implemented imagePullBackOff as fail fast. The issue with this approach
is, the node where the pod is scheduled often experiences registry rate limit.
The image pull failure because of the rate limit returns the same warning
(reason: Failed and message: ImagePullBackOff). The pod can potentially recover
after waiting for enough time until the cap is expired. Kubernetes can then
successfully pull the image and bring the pod up.

Introducing a default configuration to specify cluster level timeout to allow
the imagePullBackOff to retry for a given duration. Once that duration has
passed, return a permanent failure.

#5987
#7184

This is a manual cheery-pick of #7666

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants