Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic failures with etcdserver: request timed out on Kind Prow tests #5118

Closed
abayer opened this issue Jul 11, 2022 · 12 comments
Closed

Sporadic failures with etcdserver: request timed out on Kind Prow tests #5118

abayer opened this issue Jul 11, 2022 · 12 comments
Labels
kind/flake Categorizes issue or PR as related to a flakey test lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@abayer
Copy link
Contributor

abayer commented Jul 11, 2022

An example is https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5077/pull-tekton-pipeline-kind-alpha-yaml-tests/1546485026143604736 (link will quite possibly go away).

kubernetes-sigs/kind#717 is a similar issue, which was resolved by adding a daemonset to their Prow cluster much like the one we added in tektoncd/plumbing#1129. kubernetes-sigs/kind#1922 seems relevant as well - which suggests the issue may be disk I/O-related. The person who opened that issue ended up moving kubevirt's Kind-in-Prow tooling to use in-memory etcd at kubevirt/kubevirtci@393d0cd, which could be helpful.

In general, I'm wondering if there are optimizations we can make to the local, non-persistent disks on our GKE nodes to improve IO...

/kind flake

@tekton-robot tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Jul 11, 2022
@alequint
Copy link

Just want to register here that I have same type of problems. In my case, in the middle of a pipeline execution, while tasks are still running, suddenly the PipelineRun status changes to "Failed" with the following message:

Pipeline mas-fvtairgap-pipelines/mas-fvt-pipeline can't be Run; it contains Tasks that don't exist: Couldn't retrieve Task "mas-devops-ibm-catalogs": etcdserver: request timed out

None tasks failed and note that the task message above is complaining about it does exist (and have already been executed btw); the last part of the error message seems where the problem really is: etcdserver: request timed out. Wondering if it is lack of resource that make etcdserver not accessible, maybe local storage issues, not sure.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 17, 2022
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Repository owner moved this from Todo to Done in Tekton Community Roadmap Dec 17, 2022
@l-qing
Copy link
Contributor

l-qing commented Dec 30, 2023

/reopen

@l-qing
Copy link
Contributor

l-qing commented Dec 30, 2023

/remove-lifecycle rotten

@l-qing
Copy link
Contributor

l-qing commented Dec 30, 2023

@Yongxuanzhang I often encounter this error, and I find that the latest code in the main branch still has this issue.
I've noticed that you've recently made some changes to the code here as well.

case err != nil:
return rt, &TaskNotFoundError{
Name: pipelineTask.TaskRef.Name,
Msg: err.Error(),
}

if tresources.IsErrTransient(err) {
return nil, err
}
if errors.Is(err, remote.ErrRequestInProgress) {
return nil, err
}
var nfErr *resources.TaskNotFoundError
if errors.As(err, &nfErr) {
pr.Status.MarkFailed(v1.PipelineRunReasonCouldntGetTask.String(),
"Pipeline %s/%s can't be Run; it contains Tasks that don't exist: %s",
pipelineMeta.Namespace, pipelineMeta.Name, nfErr)

// IsErrTransient returns true if an error returned by GetTask/GetStepAction is retryable.
func IsErrTransient(err error) bool {
return strings.Contains(err.Error(), errEtcdLeaderChange)
}

// This error is defined in etcd at
// https://github.com/etcd-io/etcd/blob/5b226e0abf4100253c94bb71f47d6815877ed5a2/server/etcdserver/errors.go#L30
// TODO: If/when https://github.com/kubernetes/kubernetes/issues/106491 is addressed,
// we should stop relying on a hardcoded string.
var errEtcdLeaderChange = "etcdserver: leader changed"

Can we tolerate this temporary exception of "request timed out" within a certain time frame, allowing the error to occur?

@Yongxuanzhang
Copy link
Member

Can we tolerate this temporary exception of "request timed out" within a certain time frame, allowing the error to occur

🤔 I don't think I have changed anything related? Did you mean that we don't return this error for a time frame?

@l-qing
Copy link
Contributor

l-qing commented Jan 3, 2024

🤔 I don't think I have changed anything related?

Yes, I'm aware that this part of the code has been like this since version 0.41.0; it wasn't changed recently by you.
I thought of your comments in other issues, I mentioned you here. 😆
#7392 (comment)

Did you mean that we don't return this error for a time frame?

Yes, We need to be able to tolerate this mistake for a period of time, but we cannot tolerate it indefinitely.

@Yongxuanzhang
Copy link
Member

🤔 I don't think I have changed anything related?

Yes, I'm aware that this part of the code has been like this since version 0.41.0; it wasn't changed recently by you. I thought of your comments in other issues, I mentioned you here. 😆 #7392 (comment)

Did you mean that we don't return this error for a time frame?

Yes, We need to be able to tolerate this mistake for a period of time, but we cannot tolerate it indefinitely.

Do you want to reopen this issue or open a new one?

@l-qing
Copy link
Contributor

l-qing commented Jan 4, 2024

Do you want to reopen this issue or open a new one?

This issue already contains a lot of context information, how about we just reopen it?

If it's just ignoring the timeout error, I know how to deal with it.
But if there needs to be a time limit, I'm not sure if there is any referable code in the current framework?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flakey test lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
Status: Done
Development

No branches or pull requests

5 participants