-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow imagePullBackOff
for the specified duration
#7666
Conversation
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM but I think we should use a duration type instead of just a number (that represent minutes). It's more "user-friendly" in my opinion.
config/config-defaults.yaml
Outdated
@@ -87,6 +87,10 @@ data: | |||
# no default-resolver-type is specified by default | |||
default-resolver-type: | |||
|
|||
# default-imagepullbackoff-timeout contains the default number of minutes to wait | |||
# before requeuing the TaskRun to retry | |||
# default-imagepullbackoff-timeout: "5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to use time values instead, like 1m
, 5m
, 10s
or 1h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we need to add "minutes" in the field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at this too and thinking that in all cases I can think of K8s uses "seconds". I think that's simplest vs. supporting something fancier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have strong opinion either ways. I like the time values as it is more clear and flexible such that the timeout can be specified in seconds, minutes, hours, etc. But at the same time, it looses consistency with the existing taskRun timeout field which uses minutes
.
We have the time.Duration
implemented now, do we want to change that to adding minutes
in the field name or use seconds
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point -- we might want to keep our time units consistent across the board in the future and I think time.Duration
is truly better basis.
pkg/reconciler/taskrun/taskrun.go
Outdated
@@ -222,10 +224,30 @@ func (c *Reconciler) ReconcileKind(ctx context.Context, tr *v1.TaskRun) pkgrecon | |||
return nil | |||
} | |||
|
|||
func (c *Reconciler) checkPodFailed(tr *v1.TaskRun) (bool, v1.TaskRunReason, string) { | |||
func (c *Reconciler) checkPodFailed(tr *v1.TaskRun, ctx context.Context) (bool, v1.TaskRunReason, string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K8s API convention is that "context" should be the first parameter if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s even Go conventions 😇
5d7e065
to
78a1b3d
Compare
78a1b3d
to
940e91a
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <pdesai@us.ibm.com> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <pdesai@us.ibm.com>
940e91a
to
9bf76db
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pritidesai
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JeromeJu, skaegi, vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <pdesai@us.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pritidesai
Just a couple of minor things, maybe for a follow up PR
name: config-defaults | ||
namespace: tekton-pipelines | ||
data: | ||
default-imagepullbackoff-timeout: "5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example needs to be updated to 5m
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @afrittoli - ptal - #7679. Thanks!
if imagePullBackOffTimeOut.Seconds() != 0 { | ||
p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{}) | ||
if err != nil { | ||
message := fmt.Sprintf(`The step %q in TaskRun %q failed to pull the image %q and the pod with error: "%s."`, step.Name, tr.Name, step.ImageID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: error messages should start in lowercase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @afrittoli - ptal - #7679. Thanks!
if sidecar.Waiting.Reason == ImagePullBackOff { | ||
imagePullBackOffTimeOut := config.FromContextOrDefaults(ctx).Defaults.DefaultImagePullBackOffTimeout | ||
// only attempt to recover from the imagePullBackOff if specified | ||
if imagePullBackOffTimeOut.Seconds() != 0 { | ||
p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{}) | ||
if err != nil { | ||
message := fmt.Sprintf(`The sidecar %q in TaskRun %q failed to pull the image %q and the pod with error: "%s."`, sidecar.Name, tr.Name, sidecar.ImageID, err) | ||
return true, v1.TaskRunReasonImagePullFailed, message | ||
} | ||
for _, condition := range p.Status.Conditions { | ||
// check the pod condition to get the time when the pod was scheduled | ||
// keep trying until the pod schedule time has exceeded the specified imagePullBackOff timeout duration | ||
if condition.Type == corev1.PodScheduled { | ||
if c.Clock.Since(condition.LastTransitionTime.Time) < imagePullBackOffTimeOut { | ||
return false, "", "" | ||
} | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: this might be a reusable function instead of having the same code twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a minor difference, both operating on different datatype. Its a little tricky as one has reference to step while other has a reference to sidecar.
Thank you @vdemeester @skaegi @JeromeJu @afrittoli for the reviews! We just upgraded our deployment to 0.53 and would like to cherry pick this in 0.53 and 0.56. Thoughts? |
/cherry-pick release-v0.56.x |
@pritidesai: new pull request created: #7678 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <pdesai@us.ibm.com>
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 This is a manual cheery-pick of #7666 Signed-off-by: Priti Desai <pdesai@us.ibm.com>
Changes
We have implemented
imagePullBackOff
to fail fast. The issue with this approach is, this can be a transient error depending on the infrastructure. Often times the node where thepod
is scheduled experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason:Failed
and message:ImagePullBackOff
) compared to other authentication failure, missing image, etc. In case of a rate limit, the pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. But the fail fast approach results in ataskRun
failure and hencepipelineRun
results in a failure.Introducing a default configuration to specify cluster level timeout to allow the
imagePullBackOff
to retry for a given duration. Once that duration has passed, controller returns a permanent failure.#5987
#7184
/kind feature
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes