-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timed out waiting for all machines to be exist #8824
Comments
/triage accepted |
This seems to be the same flake: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#a20c32c92add5bfec5f5 |
/help |
@killianmuldoon: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'll gonna try to take a look into this /assign
|
Just looked at one of the cases. I think there's a realistic chance this is the same issue as here: #8786 (comment) (can be verified by looking for preflight errors in the MachineSet and then checking if KCP has a status version) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Note: this could be a different issue compared to #8786, the above linked query may lead to different issues. Analysing the prowjob linked at the first post: It looks like the control plane container is not able to start and CAPD's container creation/start does not work:
Same information in docker log
Updated link from the first comment: https://storage.googleapis.com/k8s-triage/index.html?date=2023-06-10&job=.*-cluster-api-.*&xjob=.*-provider-.*#2822326a66dd24850a9d Additional more flexible query to find the issue independent of the query id: |
So this issue here is:
|
I assume port? Ah interesting. Am I seeing correctly that we hand over 0 as a host port to docker?
This would suggest that Docker itself should pick a random port (?) (maybe I'm looking at the wrong code) |
Assuming I'm looking at the right code. I wonder if we should just implement a retry (e.g. via requeue) and be done with it :) (+ surface it in logs that we're retrying) P.S. Given that we just fixed #8786 not sure if we have a clear signal right now how often this specific issue occurs |
A simple requeue is not enough in this case. We also have to delete the container. Sidenote reproducible via:
|
@killianmuldoon: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Link to check if issue still exists on main (because cherry-picks will only get merged after) PR got merged at I'll postpone checking if it is fixed for main until wednesday 16th August for merging. This gives us 9 days to see if we did get rid of the issue on main. |
Note: after merging the cherry-picks: we should also cherry-pick #9139 on top. |
I think we can close this now - if the same issue pops up we can take another look, but this error message is the result of a number of different possible underlying errors. Thanks again for fixing this @chrischdi! /close |
There was only one occurence of this flake at However, the issue here occurred during the upgrade clusterctl upgrade tests when CAPI v1.0.5 was running. So there was no occurency since merging the fix. Also the cherry-picks got merged now:
/close |
@killianmuldoon: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which jobs are flaking?
e.g. https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-3/1666214582626029568
Which tests are flaking?
Since when has it been flaking?
Minor flakes since 04-06-2023
Testgrid link
https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#2822326a66dd24850a9d
Reason for failure (if possible)
To be analyzed.
Anything else we need to know?
No response
Label(s) to be applied
/kind flake
The text was updated successfully, but these errors were encountered: