-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clear pod's .status.nominatedNodeName when necessary #106816
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
}, | ||
// Delete the node to simulate an ErrNoNodesAvailable error. | ||
deleteNode: true, | ||
podNamesToDelete: []string{"low"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a real cluster, this is actually optional.
However, in an integration test it may flake without this - a node deletion event wouldn't trigger moving the preemptor back to activeQ, so if the preemptor happens to be backoffed and the 30 seconds timeout may not be enough. So add a pod deletion here to actively trigger retrying the preemptor.
374db93
to
612b69b
Compare
pkg/scheduler/scheduler.go
Outdated
// No nodes available is counted as unschedulable rather than an error. | ||
metrics.PodUnschedulable(fwk.ProfileName(), metrics.SinceInSeconds(start)) | ||
} else { | ||
nominatedNode = pointer.StringPtr("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory, this is the case of a transient error. Perhaps we shouldn't clear the nominated node here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit debatable depends how we interpret this kind of error. If we believe it can be auto-healed, we shouldn't clear it; otherwise clear it to prevent the pod from occupying resources.
I don't have a strong option as personally, I don't have real-world data points to determine if it's transient and how transient it is... I'm giving a pessimistic judgment here :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't have a strong opinion. But I think that if the pod really needs preemption, it would get a nominated node in the next scheduling attempt. So it seems safer for utilization to clear.
pkg/scheduler/scheduler.go
Outdated
sched.Error(podInfo, err) | ||
|
||
// Update the scheduling queue with the nominated pod information. Without | ||
// this, there would be a race condition between the next scheduling cycle | ||
// and the time the scheduler receives a Pod Update for the nominated pod. | ||
// Here we check for nil only for tests. | ||
if sched.SchedulingQueue != nil { | ||
sched.SchedulingQueue.AddNominatedPod(podInfo.PodInfo, nominatedNode) | ||
var nnn string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like nil
and empty string have the same result here. Why use a pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's more about updatePod
below.
However, it looks like AddNominatedPod
looks at the existing nominatedNode
in the status when you pass empty string. This means that the nominated node name is not actually cleared from the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's more about updatePod below.
Yes.
However, it looks like AddNominatedPod looks at the existing nominatedNode in the status when you pass empty string.
Not quite. The entry in internal nominator implementation will be cleared first in all cases:
kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go
Lines 834 to 845 in 4cc1abf
func (npm *nominator) add(pi *framework.PodInfo, nodeName string) { | |
// always delete the pod if it already exist, to ensure we never store more than | |
// one instance of the pod. | |
npm.delete(pi.Pod) | |
nnn := nodeName | |
if len(nnn) == 0 { | |
nnn = NominatedNodeName(pi.Pod) | |
if len(nnn) == 0 { | |
return | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is this line:
nnn = NominatedNodeName(pi.Pod) |
pkg/scheduler/scheduler.go
Outdated
sched.Error(podInfo, err) | ||
|
||
// Update the scheduling queue with the nominated pod information. Without | ||
// this, there would be a race condition between the next scheduling cycle | ||
// and the time the scheduler receives a Pod Update for the nominated pod. | ||
// Here we check for nil only for tests. | ||
if sched.SchedulingQueue != nil { | ||
sched.SchedulingQueue.AddNominatedPod(podInfo.PodInfo, nominatedNode) | ||
var nnn string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's more about updatePod
below.
However, it looks like AddNominatedPod
looks at the existing nominatedNode
in the status when you pass empty string. This means that the nominated node name is not actually cleared from the cache.
83201fb
to
ba9c8c7
Compare
@@ -692,7 +693,7 @@ func (npm *nominator) DeleteNominatedPodIfExists(pod *v1.Pod) { | |||
// This is called during the preemption process after a node is nominated to run | |||
// the pod. We update the structure before sending a request to update the pod | |||
// object to avoid races with the following scheduling cycles. | |||
func (npm *nominator) AddNominatedPod(pi *framework.PodInfo, nodeName string) { | |||
func (npm *nominator) AddNominatedPod(pi *framework.PodInfo, nodeName *string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the semantics of nil
vs empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to compare nil vs. non-nil. PTAL.
if len(nnn) == 0 { | ||
return | ||
} | ||
if nodeName == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so nil
removes the nominated node name?
Wasn't that supposed to be empty string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the before, if an empty string is passed in, we always look at and honor the nominatedNodeName if it carries. Now the logic gets tweaked:
- if you want to follow the old logic, on the caller side, use
adaptiveNominatedNodeName()
to always honor nominatedNodeName - if you want to enforce overriding, pass in a non-nil string pointer (no matter it's empty or a concrete string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not liking adaptiveNominatedNodeName
, as it could be forgotten. I think it should be handled within this function: if passing nil, leave; otherwise override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if passing nil, leave; otherwise override
Passing the string pointer still needs to be decided at the caller side - whether it's a pod with empty nominatedNodeName, so we interpreted it as nil; or, the caller have baked the string pointer properly (like recordSchedulingFailure()
).
If we go handle the adaptiveNominatedNodeName
logic in the function, if a user just pass in a pod.status.nominatedNodeName, inside the function, we don't have enough clues to know the exact intents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my proposal:
nil
means do not change anything. Then, we would query the nominatedNodeName from the pod status.""
means clear"foo"
is override
I think that logic can be handled here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me and can save the efforts on the caller side.
One question on "" means clear
: to keep the behavior with before, "" and "regular string" are treated the same. If we really want to clear the entry, we can do it in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think distinguishing between nil and empty is asking for trouble. I also didn't like the old code that implicitly treated an empty nodeName as a fallback to the value in the pod. Lots of implicit behavior that one can't figure out without carefully parsing the code.
Not sure if this was the old code, but why not simply have the following:
// Call the following in the places were you are currently passing in nil.
add(pi) {
addToNode(pi, pi.Pod.Status.NominatedNodeName)
}
// Alway adds the pod to the passed nodeName
addToNode(pi, nodeName) {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this was the old code, but why not simply have the following
This is basically the previous version except for that some places needs to inject NominatedNodeName adapatively.
7d89863
to
6f80da5
Compare
// | ||
// - <PostFilterResult{}, Unschedulable>. It indicates the pod cannot be scheduled even with preemption. | ||
// In this case, an empty PostFilterResult is returned to clear the pod's nominatedNodeName (if it carries). | ||
// - <PostFilterResult{non-empty nominatedNodeName}, Unschedulable>. It's the regular happy path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be success, not unschedulable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo.. Will update along with further changes.
// - <nil, Unschedulable>. This status is mostly as expected like the preemptor is waiting for the | ||
// victims to be fully terminated. | ||
// - In both cases above, a nil PostFilterResult is returned to keep the pod's nominatedNodeName unchanged. | ||
// | ||
// - <PostFilterResult{}, Unschedulable>. It indicates the pod cannot be scheduled even with preemption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distinction between nil and empty is an anti-pattern and confusing. I suggest two options:
-
Lets add a field to
PostFilterResult
to explicitly express the distinction between nil and empty you are having now. -
instead of returning
Unschedulable
, returnUnschedulableUnresolvable
in the case where we are returning an empty result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of option 1, but it's sort of non-trivial work. Let me think about it.
if len(nnn) == 0 { | ||
return | ||
} | ||
if nodeName == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think distinguishing between nil and empty is asking for trouble. I also didn't like the old code that implicitly treated an empty nodeName as a fallback to the value in the pod. Lots of implicit behavior that one can't figure out without carefully parsing the code.
Not sure if this was the old code, but why not simply have the following:
// Call the following in the places were you are currently passing in nil.
add(pi) {
addToNode(pi, pi.Pod.Status.NominatedNodeName)
}
// Alway adds the pod to the passed nodeName
addToNode(pi, nodeName) {
...
}
pkg/scheduler/framework/interface.go
Outdated
// Notes for the semantics of `nodeName`: | ||
// - if it's nil, the backing nominator aborts if the pod doesn't carry a nominatedNodeName; otherwise | ||
// sets the carried nominatedNodeName. | ||
// - if it's non-nil, the internal nominating bits will be updated with the given `nodeName`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// - if it's non-nil, the internal nominating bits will be updated with the given `nodeName`. | |
// - if it's non-nil, the internal cache will be updated with the given `nodeName`. |
pkg/scheduler/framework/interface.go
Outdated
// - if it's nil, the backing nominator aborts if the pod doesn't carry a nominatedNodeName; otherwise | ||
// sets the carried nominatedNodeName. | ||
// - if it's non-nil, the internal nominating bits will be updated with the given `nodeName`. | ||
AddNominatedPod(pod *PodInfo, nodeName *string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overloading the parameter nodeName to signal more than simply the nodeName is problematic, I suggest making the interface explicit. Here are some options:
-
Is there really a legitimate case where the nodeName is different from the one set in the passed pod if one is set? If so, I think we need to revisit that because I don't think there should be one.
-
If we fail with option 1, I think this function should have consistent semantics and always adds pod to the passed nodeName, no exceptions and fail on an empty
nodeName
. The caller of this function can do
use a wrapper function as follows (not part of this interface):
func AddNominatedPodIfSet(pod) {
if pi.Pod.Status.NominatedNodeName != "" {
AddNominatedPod(pod, pi.Pod.Status.NominatedNodeName)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re 1: yes, it can be empty in recordSchedulingFailure() to clear the entry, or UpdateNominatedPod() which reveals a special case that right after a preemptor is marked as nominated in memory, and at the same time a podUpdate event (doesn't carry nominatedNodeName) comes in.
Re 2: the wrapper can help refactor the code for better readability. Thanks.
I appended a commit to use |
692943e
to
004df0e
Compare
// PostFilterResult wraps needed info for scheduler framework to act upon PostFilter phase. | ||
type PostFilterResult struct { | ||
NominatedNodeName string | ||
*NominatingInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be a pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not mandatory. But it can simplify the code on caller side. (just pass in a nil to indicate its mode is Noop)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's good when passing NominatingInfo
to a function.
But when returning a Noop result, we are just using PostFilterResult=nil
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But when returning a Noop result, we are just using
PostFilterResult=nil
, no?
A Noop result equals a nil NominatingInfo
, but not quite equivalent to PostFilterResult
as NominatingInfo
is a sub-field of PostFilterResult
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm saying is that you are already using PostFilterResult=nil
to signal Noop, so I don't see much point on having a pointer here. However, it's such a minor detail that I don't feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Given a pointer NominatingInfo can be carried over directly to other places (like recordSchedulingFailure()) that need a pointer, I'm inclined to keep it as-is.
pkg/scheduler/framework/plugins/defaultpreemption/default_preemption_test.go
Outdated
Show resolved
Hide resolved
// We won't fall into below `if` block if the Update event represents: | ||
// (1) NominatedNode info is added | ||
// (2) NominatedNode info is updated | ||
// (3) NominatedNode info is removed | ||
if NominatedNodeName(oldPod) == "" && NominatedNodeName(newPodInfo.Pod) == "" { | ||
if nnn, ok := npm.nominatedPodToNode[oldPod.UID]; ok { | ||
// This is the only case we should continue reserving the NominatedNode | ||
nodeName = nnn | ||
nominatingInfo = &framework.NominatingInfo{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the New
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add a New functin for NominatingInfo. The NewPostFilterResultWithNominatedNode is for simplifying instantiation of PostFilterResult.
|
||
cs := testCtx.ClientSet | ||
type alwaysFail struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment for the functionality of this plugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
podNamesToDelete []string | ||
|
||
// Register dummy plugin to simulate particular scheduling failures. Optional. | ||
outOfTreePlugins *v1beta3.Plugins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
customPlugins
if err != nil { | ||
t.Errorf("Error getting the medium pod: %v", err) | ||
} | ||
if len(pod.Status.NominatedNodeName) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have test cases for when the nnn is not cleared?
Alternatively, instead of special casing "medium", is it worth having a map of the expected nnn or assigned nodes for all the pods not removed? Then you can do a cmp.Diff
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably, but I won't bother here until we have a valid requirement in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have test cases for when the nnn is not cleared?
We have e2e test covers this, like PreemptionExecutionPath
. But I'm going to add an integration test for this specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did add a new integration test, but it needs to consistently verify if the pod's nomiantedNode is NOT cleared. It adds up the test time by 15 Seconds * 2 (the first check is to ensure an upcoming low-priority doesn't get scheduled, the second check is to ensure the preemptor's nnn is not cleared).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm... sad that there is no way to know if there was a scheduling attempt that includes all the pod/node updates.
Maybe we shouldn't commit a test that is expected to stay idle for 15 seconds?
Which leave us with an untested path, but we still have the e2e and unit tests. WDYT? Sorry for the back and forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we shouldn't commit a test that is expected to stay idle for 15 seconds?
That's my concern as well.
Which leave us with an untested path, but we still have the e2e and unit tests. WDYT?
I think we can remove this integration test. The e2e test PreemptionExecutionPath
has ensured the same path coverage.
004df0e
to
7677515
Compare
err := wait.Poll(100*time.Millisecond, 15*time.Second, func() (bool, error) { | ||
pod, err := cs.CoreV1().Pods(ns).Get(context.TODO(), "medium", metav1.GetOptions{}) | ||
if err != nil { | ||
// t.Errorf("Error getting the medium pod: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted the most recent integration test and this should be gone.
if err != nil { | ||
t.Errorf("Error getting the medium pod: %v", err) | ||
} | ||
if len(pod.Status.NominatedNodeName) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm... sad that there is no way to know if there was a scheduling attempt that includes all the pod/node updates.
Maybe we shouldn't commit a test that is expected to stay idle for 15 seconds?
Which leave us with an untested path, but we still have the e2e and unit tests. WDYT? Sorry for the back and forth.
d59ebfb
to
2433b08
Compare
/lgtm |
/lgtm thanks Wei, the explicit api for how to deal with nominatedNodeName feels more readable and easier to reason about. |
/retest Thanks @alculquicondor and @ahg-g for pushing me on the readability refactoring :) Given this is going to benefit CA (#85677) and scheduler plugins (#106780), I'm going to back port this fix if no objections. |
SGTM |
Interesting, the log (https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/106816/pull-kubernetes-integration/1471554622542843904) says all tests passed but the CI job failed. /retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
/triage accepted |
@Huang-Wei @ahg-g this fix is only needed on the release-1.23 branch? not in 1.22/1.21/1.20? thanks |
I think 1.22 and 1.21 would benefit - https://github.com/kubernetes-sigs/scheduler-plugins/releasesis affected by this is on 1.21. |
I find the fix too large to patch back. btw, I think this issue existed for a long time, it is not a new bug. |
you mean, to patch back beyond 1.23, right? +1 to just backporting to 1.23 |
yeah, the patch to 1.23 should probably be automatic, but beyond that it will likely face conflicts. |
thanks for the update and clarifications |
…06816-upstream-release-1.23 Automated cherry pick of #106816: clear pod's .status.nominatedNodeName when necessary
What type of PR is this?
/kind bug
/sig scheduling
What this PR does / why we need it:
Clear a pod's .status.nominatedNodeName when needed.
Which issue(s) this PR fixes:
Fixes #106780 #85677
Special notes for your reviewer:
We may consider back-porting this as it's a bug fix.
Does this PR introduce a user-facing change?