-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic failures with etcdserver: request timed out
on Kind Prow tests
#5118
Comments
Just want to register here that I have same type of problems. In my case, in the middle of a pipeline execution, while tasks are still running, suddenly the PipelineRun status changes to "Failed" with the following message:
None tasks failed and note that the task message above is complaining about it does exist (and have already been executed btw); the last part of the error message seems where the problem really is: |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
/remove-lifecycle rotten |
@Yongxuanzhang I often encounter this error, and I find that the latest code in the main branch still has this issue. pipeline/pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go Lines 645 to 649 in 6cc2254
pipeline/pkg/reconciler/pipelinerun/pipelinerun.go Lines 364 to 374 in 6cc2254
pipeline/pkg/reconciler/taskrun/resources/taskref.go Lines 283 to 286 in 6cc2254
pipeline/pkg/reconciler/taskrun/resources/taskref.go Lines 40 to 44 in 6cc2254
Can we tolerate this temporary exception of "request timed out" within a certain time frame, allowing the error to occur? |
🤔 I don't think I have changed anything related? Did you mean that we don't return this error for a time frame? |
Yes, I'm aware that this part of the code has been like this since version 0.41.0; it wasn't changed recently by you.
Yes, We need to be able to tolerate this mistake for a period of time, but we cannot tolerate it indefinitely. |
Do you want to reopen this issue or open a new one? |
This issue already contains a lot of context information, how about we just reopen it? If it's just ignoring the timeout error, I know how to deal with it. |
An example is https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5077/pull-tekton-pipeline-kind-alpha-yaml-tests/1546485026143604736 (link will quite possibly go away).
kubernetes-sigs/kind#717 is a similar issue, which was resolved by adding a daemonset to their Prow cluster much like the one we added in tektoncd/plumbing#1129. kubernetes-sigs/kind#1922 seems relevant as well - which suggests the issue may be disk I/O-related. The person who opened that issue ended up moving kubevirt's Kind-in-Prow tooling to use in-memory etcd at kubevirt/kubevirtci@393d0cd, which could be helpful.
In general, I'm wondering if there are optimizations we can make to the local, non-persistent disks on our GKE nodes to improve IO...
/kind flake
The text was updated successfully, but these errors were encountered: