-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add InstanceExistsByProviderID to cloud provider interface for CCM #51087
Add InstanceExistsByProviderID to cloud provider interface for CCM #51087
Conversation
@prydie: GitHub didn't allow me to request PR reviews from the following users: jhorwit2. Note that only kubernetes members can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @prydie. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
pkg/cloudprovider/cloud.go
Outdated
InstanceExists(nodeName types.NodeName) (bool, error) | ||
// InstanceExistsByProviderID returns true if the instance for the given provider id still is running. | ||
// If false is returned with no error, the instance will be immediately deleted. | ||
// Note: The `cloudprovider.InstanceNotFound` error should not be returned here to notify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not? Please clarify the comment here, it's not obvious what you mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luxas do you think I should just omit that? I added it originally since people are use to using that error to denote the instance is gone, so I wanted to call out to not use that. However, calling it out in hindsight seems more confusing. Thoughts on just removing that note?
@@ -372,6 +382,18 @@ func excludeTaintFromList(taints []v1.Taint, toExclude v1.Taint) []v1.Taint { | |||
return newTaints | |||
} | |||
|
|||
func ensureNodeExistsByProviderIDOrName(instances cloudprovider.Instances, node *v1.Node) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment and/or unit test here please
a620a66
to
8fc4a7c
Compare
/retest |
Oh whoops my changes aren't being picked up... weird. |
Thanks for the PR. I reviewed it. The problem with this PR is the ByNameOrID pattern has been followed, but without a reason. The ByName method does not add any value. Aren't we all moving to addressing nodes by ProviderID? Woudln't it suffice to just add the ByProviderID method? Additionally, this is breaking existing functionality. Have you tested this change? Thirdly, We had the ByNameOrProviderID pattern to allow CCM to work without breaking existing functionality. If I merge this, CCM will stop working. |
@wlan0 This is the same pattern followed in the zones PR that you gave your stamp of approval on. Should we opt to remove The only existing functionality that would be broken is for the CCM, correct? That's the only place the |
@jhorwit2 The Zones PR required both methods. The GetZone() method did not take any parameter because it was always called from kubelet/always from the node which was querying the zone. There was no existing method for getting zones for a node from a controller either ByName or byID, and since providerID was not always supported, it made sense to have both methods there. That is why it got my stamp of approval. In this case, externalID is already a method that already exists for finding node existence by Name. Therefore it is unnecessary to add another one. |
@jhorwit2 Yes. This PR would break the functionality of the CCM. |
For what it's worth, this would definitely break DigitalOcean's CCM if I did a vendor update, but it would be acceptable since CCM is technically alpha still (right?). Another thing to consider though is that we'll be making it more difficult for existing cloud providers to migrate since we're adding more friction to integrate CCM (having to implement InstanceExists* methods). On the plus side, adding this change now will mean we can deprecate |
Zones technically breaks existing functionality as well just in a different way. Any existing clusters managing by the CCM will have nodes with a random failure zone/region value set by the current CCM implementations i've seen. The CCM has no way to correct that now with the updated Zone logic. I'm all for changing to only using provider id and moving to one method. I guess the point I'm trying to make is why support continue to If you want I'll change the current check to something like
so we keep using |
Yes, this needs to be fixed. The zones API does not add this bug though. Actually, It gets one step closer to fixing this. Also, its not a random zone/region that gets assigned. It is the zone in which the CCM is running.
It is not deprecated. Where did you get this information? |
Yes, and others. It is not acceptable to introduce breaking changes especially since we don't have all the cloudvendors ready to fill the voids that fix the breaks. It is always advisable to make non-breaking changes.
yes that is correct. Also, deprecation of ExternalID is not something that has been decided. Why are we all working under that assumption? |
@@ -1101,6 +1101,11 @@ func (c *Cloud) ExternalID(nodeName types.NodeName) (string, error) { | |||
return orEmpty(instance.InstanceId), nil | |||
} | |||
|
|||
// InstanceExistsByProviderID returns true if the instance with the given provider id still exists and is running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think it hurts to add comments from the cloud interface here for visibility
If false is returned with no error, the instance will be immediately deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
instances, _ := fc.Instances() | ||
exists, err := ensureNodeExistsByProviderIDOrName(instances, tc.node) | ||
if err != nil { | ||
t.Fatal(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, t.Fatal
is for when you fail to setup a test properly, t.Error
for when a test check fails
t.Fatalf("expected exist by provider id to be `%t` but got `%t`", tc.existsByProviderID, exists) | ||
} | ||
|
||
if tc.existsByNodeName && tc.existsByNodeName != exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tc.existsByNodeName != exists
is fine?
t.Fatalf("expected cloud provider methods `%v` to be called but `%v` was called ", tc.expectedCalls, fc.Calls) | ||
} | ||
|
||
if tc.existsByProviderID && tc.existsByProviderID != exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tc.existsByProviderID != exists
should be fine here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't work since you have two things to check to determine existence.
Test Case: if exists should be true because it exists by node name and not provider it, this test would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a previous way people were testing these "by provider id or node name" methods, so if you have a better idea i'm all ears :)
t.Fatalf("expected exist by node name to be `%t` but got `%t`", tc.existsByNodeName, exists) | ||
} | ||
|
||
if !tc.existsByNodeName && !tc.existsByProviderID && exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check seems unesscary, if tc.existsByNodeName
and tc.existsByProviderID
are properly set then the previous checks should have failed the test?
ping me when all other reviewers' comments are addressed please |
@luxas i've addressed all the comments. |
Assign me AFTER it is lgtm'ed :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small-ish comments and clarifications
On a high level, LGTM though
@@ -1101,6 +1101,12 @@ func (c *Cloud) ExternalID(nodeName types.NodeName) (string, error) { | |||
return orEmpty(instance.InstanceId), nil | |||
} | |||
|
|||
// InstanceExistsByProviderID returns true if the instance with the given provider id still exists and is running. | |||
// If false is returned with no error, the instance will be immediately deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... by the cloud controller manager?
}(node.Name) | ||
} | ||
glog.Errorf("Error getting node data from cloud: %v", err) | ||
exists, err := ensureNodeExistsByProviderIDOrExternalID(instances, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we deprecated the ExternalID name everywhere, why not ensureNodeExistsByProviderIDOrNodeName
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) that's what I had before, but changed it based on @wlan0's feedback.
exists, err := instances.InstanceExistsByProviderID(node.Spec.ProviderID) | ||
if err != nil { | ||
providerIDErr := err | ||
_, err = instances.ExternalID(types.NodeName(node.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not instances.NodeName
like we talked about?
Isn't that implemented yet? (Did it stall out in a PR somewhere?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall us discussing that. Mind filling me in??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we landed on keeping the ExistsByProviderID & ExternalID checks. Then in the future we can add a ExistsByNodeName if we want to, but avoid adding that method now since it's a duplicate of ExternalID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, never mind 😄
That was the case indeed, recalled incorrectly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/assign @thockin
/retest |
bazel flake: #51379 |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: prydie, thockin Associated issue: 50985 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test pull-kubernetes-e2e-gce-bazel |
/test pull-kubernetes-e2e-gce-etcd3
… On Aug 25, 2017, at 8:16 PM, k8s-ci-robot ***@***.***> wrote:
@prydie: The following tests failed, say /retest to rerun them all:
Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-bazel cf75c49 link /test pull-kubernetes-e2e-gce-bazel
pull-kubernetes-e2e-gce-etcd3 cf75c49 link /test pull-kubernetes-e2e-gce-etcd3
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Automatic merge from submit-queue (batch tested with PRs 51174, 51363, 51087, 51382, 51388) |
…cm-instance-exists Automatic merge from submit-queue (batch tested with PRs 51174, 51363, 51087, 51382, 51388) Add InstanceExistsByProviderID to cloud provider interface for CCM **What this PR does / why we need it**: Currently, [`MonitorNode()`](https://github.com/kubernetes/kubernetes/blob/02b520f0a40be2056d91fc0661c2b4fdb2694c30/pkg/controller/cloud/nodecontroller.go#L240) in the node controller checks with the CCM if a node still exists by calling `ExternalID(nodeName)`. `ExternalID` is supposed to return the provider id of a node which is not supported on every cloud. This means that any clouds who cannot infer the provider id by the node name from a remote location will never remove nodes that no longer exist. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes#50985 **Special notes for your reviewer**: We'll want to create a subsequent issue to track the implementation of these two new methods in the cloud providers. **Release note**: ```release-note Adds `InstanceExists` and `InstanceExistsByProviderID` to cloud provider interface for the cloud controller manager ``` /cc @wlan0 @thockin @andrewsykim @luxas @jhorwit2 /area cloudprovider /sig cluster-lifecycle
What this PR does / why we need it:
Currently,
MonitorNode()
in the node controller checks with the CCM if a node still exists by callingExternalID(nodeName)
.ExternalID
is supposed to return the provider id of a node which is not supported on every cloud. This means that any clouds who cannot infer the provider id by the node name from a remote location will never remove nodes that no longer exist.Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #50985Special notes for your reviewer:
We'll want to create a subsequent issue to track the implementation of these two new methods in the cloud providers.
Release note:
/cc @wlan0 @thockin @andrewsykim @luxas @jhorwit2
/area cloudprovider
/sig cluster-lifecycle