Sporadic failures with `etcdserver: request timed out` on Kind Prow tests #5118

abayer · 2022-07-11T15:01:35Z

An example is https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5077/pull-tekton-pipeline-kind-alpha-yaml-tests/1546485026143604736 (link will quite possibly go away).

kubernetes-sigs/kind#717 is a similar issue, which was resolved by adding a daemonset to their Prow cluster much like the one we added in tektoncd/plumbing#1129. kubernetes-sigs/kind#1922 seems relevant as well - which suggests the issue may be disk I/O-related. The person who opened that issue ended up moving kubevirt's Kind-in-Prow tooling to use in-memory etcd at kubevirt/kubevirtci@393d0cd, which could be helpful.

In general, I'm wondering if there are optimizations we can make to the local, non-persistent disks on our GKE nodes to improve IO...

/kind flake

The text was updated successfully, but these errors were encountered:

alequint · 2022-07-20T05:34:37Z

Just want to register here that I have same type of problems. In my case, in the middle of a pipeline execution, while tasks are still running, suddenly the PipelineRun status changes to "Failed" with the following message:

Pipeline mas-fvtairgap-pipelines/mas-fvt-pipeline can't be Run; it contains Tasks that don't exist: Couldn't retrieve Task "mas-devops-ibm-catalogs": etcdserver: request timed out

None tasks failed and note that the task message above is complaining about it does exist (and have already been executed btw); the last part of the error message seems where the problem really is: etcdserver: request timed out. Wondering if it is lack of resource that make etcdserver not accessible, maybe local storage issues, not sure.

tekton-robot · 2022-10-18T06:03:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-11-17T06:21:04Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-12-17T06:48:42Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-12-17T06:48:44Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

l-qing · 2023-12-30T05:10:04Z

/reopen

l-qing · 2023-12-30T05:10:34Z

/remove-lifecycle rotten

l-qing · 2023-12-30T05:21:18Z

@Yongxuanzhang I often encounter this error, and I find that the latest code in the main branch still has this issue.
I've noticed that you've recently made some changes to the code here as well.

pipeline/pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go

Lines 645 to 649 in 6cc2254

    
           case err != nil: 
        
           	return rt, &TaskNotFoundError{ 
        
           		Name: pipelineTask.TaskRef.Name, 
        
           		Msg:  err.Error(), 
        
           	}

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 364 to 374 in 6cc2254

    
           if tresources.IsErrTransient(err) { 
        
           	return nil, err 
        
           } 
        
           if errors.Is(err, remote.ErrRequestInProgress) { 
        
           	return nil, err 
        
           } 
        
           var nfErr *resources.TaskNotFoundError 
        
           if errors.As(err, &nfErr) { 
        
           	pr.Status.MarkFailed(v1.PipelineRunReasonCouldntGetTask.String(), 
        
           		"Pipeline %s/%s can't be Run; it contains Tasks that don't exist: %s", 
        
           		pipelineMeta.Namespace, pipelineMeta.Name, nfErr)

pipeline/pkg/reconciler/taskrun/resources/taskref.go

Lines 283 to 286 in 6cc2254

    
           // IsErrTransient returns true if an error returned by GetTask/GetStepAction is retryable. 
        
           func IsErrTransient(err error) bool { 
        
           	return strings.Contains(err.Error(), errEtcdLeaderChange) 
        
           }

pipeline/pkg/reconciler/taskrun/resources/taskref.go

Lines 40 to 44 in 6cc2254

    
           // This error is defined in etcd at 
        
           // https://github.com/etcd-io/etcd/blob/5b226e0abf4100253c94bb71f47d6815877ed5a2/server/etcdserver/errors.go#L30 
        
           // TODO: If/when https://github.com/kubernetes/kubernetes/issues/106491 is addressed, 
        
           // we should stop relying on a hardcoded string. 
        
           var errEtcdLeaderChange = "etcdserver: leader changed"

Can we tolerate this temporary exception of "request timed out" within a certain time frame, allowing the error to occur?

Yongxuanzhang · 2024-01-02T21:18:48Z

Can we tolerate this temporary exception of "request timed out" within a certain time frame, allowing the error to occur

🤔 I don't think I have changed anything related? Did you mean that we don't return this error for a time frame?

l-qing · 2024-01-03T07:56:31Z

🤔 I don't think I have changed anything related?

Yes, I'm aware that this part of the code has been like this since version 0.41.0; it wasn't changed recently by you.
I thought of your comments in other issues, I mentioned you here. 😆
#7392 (comment)

Did you mean that we don't return this error for a time frame?

Yes, We need to be able to tolerate this mistake for a period of time, but we cannot tolerate it indefinitely.

Yongxuanzhang · 2024-01-03T17:07:37Z

🤔 I don't think I have changed anything related?

Yes, I'm aware that this part of the code has been like this since version 0.41.0; it wasn't changed recently by you. I thought of your comments in other issues, I mentioned you here. 😆 #7392 (comment)

Did you mean that we don't return this error for a time frame?

Yes, We need to be able to tolerate this mistake for a period of time, but we cannot tolerate it indefinitely.

Do you want to reopen this issue or open a new one?

l-qing · 2024-01-04T04:40:28Z

Do you want to reopen this issue or open a new one?

This issue already contains a lot of context information, how about we just reopen it?

If it's just ignoring the timeout error, I know how to deal with it.
But if there needs to be a time limit, I'm not sure if there is any referable code in the current framework?

tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Jul 11, 2022

xchapter7x added this to Tekton Community Roadmap Sep 20, 2022

xchapter7x moved this to Todo in Tekton Community Roadmap Sep 20, 2022

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 17, 2022

tekton-robot closed this as completed Dec 17, 2022

Repository owner moved this from Todo to Done in Tekton Community Roadmap Dec 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic failures with `etcdserver: request timed out` on Kind Prow tests #5118

Sporadic failures with `etcdserver: request timed out` on Kind Prow tests #5118

abayer commented Jul 11, 2022

alequint commented Jul 20, 2022

tekton-robot commented Oct 18, 2022

tekton-robot commented Nov 17, 2022

tekton-robot commented Dec 17, 2022

tekton-robot commented Dec 17, 2022

l-qing commented Dec 30, 2023

l-qing commented Dec 30, 2023

l-qing commented Dec 30, 2023

Yongxuanzhang commented Jan 2, 2024

l-qing commented Jan 3, 2024

Yongxuanzhang commented Jan 3, 2024

l-qing commented Jan 4, 2024

Sporadic failures with etcdserver: request timed out on Kind Prow tests #5118

Sporadic failures with etcdserver: request timed out on Kind Prow tests #5118

Comments

abayer commented Jul 11, 2022

alequint commented Jul 20, 2022

tekton-robot commented Oct 18, 2022

tekton-robot commented Nov 17, 2022

tekton-robot commented Dec 17, 2022

tekton-robot commented Dec 17, 2022

l-qing commented Dec 30, 2023

l-qing commented Dec 30, 2023

l-qing commented Dec 30, 2023

Yongxuanzhang commented Jan 2, 2024

l-qing commented Jan 3, 2024

Yongxuanzhang commented Jan 3, 2024

l-qing commented Jan 4, 2024

Sporadic failures with `etcdserver: request timed out` on Kind Prow tests #5118

Sporadic failures with `etcdserver: request timed out` on Kind Prow tests #5118