-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1873288: server: Target the spec configuration if we have at least one node #2035
Conversation
(only compile tested) |
Also see #1619 that touched on this a bit. |
The CI cluster hit an issue where a pull secret was broken, and then we hit a deadlock because the MCO failed to drain nodes on the old config, because other nodes on the old config couldn't schedule the pod. It just generally makes sense for new nodes to use the new config; do so as long as at least one node has successfully joined the cluster at that config. This way we still avoid breaking the cluster (and scaleup) with a bad config.
ee0eedf
to
4bd204d
Compare
It looks like actually in https://bugzilla.redhat.com/show_bug.cgi?id=1873288 we didn't have any nodes successfully on the new config, so this wouldn't have helped - but I think that's mostly bad luck - the new config could have rolled out but we just happened to pick a node that couldn't drain. Perhaps the other alternative of a |
@cgwalters: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Now this PR also would have greatly mitigated the issue we hit in #2167 As is right now, scaling up during an upgrade basically just makes things slower and worse because:
With this instead the flow would be what you'd expect:
And now the node is ready to take on workloads that need migration from previously existing nodes. |
@cgwalters: This pull request references Bugzilla bug 1873288, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This, related to the BZ just attached is very much welcome in 4.7 if we can fix the bz and have something that works in those scenarios |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favour of this approach. A use case is e.g. for CI where we dynamically spin up and down new nodes, upgrades could be stalled much longer as all new nodes will have to update (and drain workloads which will take hours). Will do some manual testing
lgtm. Will wait for Jerry's test result. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
23 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@cgwalters: All pull requests linked via external trackers have merged: Bugzilla bug 1873288 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cgwalters @yuqi-zhang any chance this could get backported to 4.6? |
/cherrypick release-4.6 |
@cgwalters: new pull request created: #2225 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The CI cluster hit an issue where a pull secret was broken, and
then we hit a deadlock because the MCO failed to drain nodes on
the old config, because other nodes on the old config couldn't
schedule the pod.
It just generally makes sense for new nodes to use the new config;
do so as long as at least one node has successfully joined the
cluster at that config. This way we still avoid breaking
the cluster (and scaleup) with a bad config.
xref: https://bugzilla.redhat.com/show_bug.cgi?id=1873288