-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler should terminate on loosing leader lock #81306
Scheduler should terminate on loosing leader lock #81306
Conversation
/sig scheduling |
/priority important-soon |
@@ -262,7 +261,7 @@ func Run(cc schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}, regis | |||
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{ | |||
OnStartedLeading: run, | |||
OnStoppedLeading: func() { | |||
utilruntime.HandleError(fmt.Errorf("lost master")) | |||
klog.Fatalf("leaderelection lost") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Now we're consistent with our neighbours:
cmd/kube-controller-manager/app/controllermanager.go
281: OnStoppedLeading: func() {
282- klog.Fatalf("leaderelection lost")
cmd/kube-scheduler/app/server.go
264: OnStoppedLeading: func() {
265- utilruntime.HandleError(fmt.Errorf("lost master"))
cmd/cloud-controller-manager/app/controllermanager.go
209: OnStoppedLeading: func() {
210- klog.Fatalf("leaderelection lost")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Thanks, @ravisantoshgudimetla for fixing this important bug. We must backport this to older versions (1.13+) of K8s.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, ravisantoshgudimetla The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/hold |
Sorry to bother you, could you please provide some real cases and how to reproduce this issue?
logs as below: (ignore the unsupported 1.10 version, this snippet hasn't changed since then)
It seems all work fine. kubernetes/cmd/kube-scheduler/app/server.go Lines 261 to 275 in 890b50f
if we want to be consistent with our neighbors, we can merge this but it targets as cleanup (not bugfix, so no need to backport). |
It depends on how long you waited before the leader looses lock, in my case the scheduler was able to communicate with apiserver before it timed out waiting for condition, the scenario I am talking about is a network condition where we would loose connectivity with apiserver for 30 seconds and then scheduler is able to communicate again, it's not a permanent network failure which would cause the above scenario you mentioned. |
I think no matters how long the scheduler wait, once the connection recover, the scheduler will detect himself has lost lock and exit. |
@gaorong - Thanks for testing the above scenario. What type of resourceLocks are you using? configmaps or endpoints? Can you simulate the same test with kube 1.12+ and see if you're able to reproduce this? It's interesting that the scheduler is terminating for you, I looked more in 1.10 code base and I think leader election code is different. Following are the logs from v1.14.0+0faddd8
You can clearly see that after What's happening? The code path is slightly different in leader election code, scheduler in your case is failing at lock acquistion state while renewing and the channel is closing right after that here - https://github.com/kubernetes/kubernetes/blob/release-1.10/staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go#L162 We moved to contexts from channels in #57932 1.12+ and I think this line
|
the default: endpoint
Yes, scheduler can still terminate as before. logs as below:
it's weird to have different behavior. I think leader-election should be compatible with previous behavior (in v1.10). |
Do you have connectivity to apiserver from scheduler, looking at the logs, it seems apiserver is not running. The scenario is specific to situation where the scheduler acquired lock and releasing it back.
I am using configmaps instead of endpoints/ |
apiserver is listening on port 8080, I drop all tcp packages sending to apiserver by iptables:
How can we simulate this scenario? |
Do you set kubernetes/staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go Lines 148 to 153 in a520302
|
@ravisantoshgudimetla I am not familiar with kube-scheduler in HA mode, but I have a few small questions, could you please help: Shouldn't the previous leader be fenced when the new leader is selected? Seems the previous leader just exit as soon as possible, still cannot totally avoid your race condition that multiple scheduler are scheduling pods at the same time, right? So, can I consider this PR is still best effort to avoid the race condition above, but it largerly reduces the chance, right? |
How're the other schedulers connecting to apiserver? Are they running on the same host? If you block inbound traffic on port 8080, they'd also loose connectivity, isn't it?
I am currently using a 3 node control plane cluster, where each scheduler on every node connects to apiserver running locally. As of now, I am bouncing apiserver one node at a time to simulate this scenario
Good question, while I am not certain that scenario can happen in the above situation, there is a very good chance that apiserver received the request just before scheduler exited, the other scheduler which gained the lock won't send the request again because it'll have to wait for the informer caches to sync initially(actually even before acquiring the lock). The informers would talk to apiserver and get the latest state of pod before building the local cache and start scheduling. The main problem in the situation that I mentioned here is scheduler without lock is continuously sending requests to apiserver for binding or trying to reschedule pods that are already scheduled by the right scheduler (the one which has leader lock). But in general, kubernetes is based on eventual consistency, where failure can happen sometimes because of various reasons but system should correct itself after some time. |
No, there aren't. my environment is same as you: each scheduler on every node connects to apiserver running locally. |
@gaorong , @ravisantoshgudimetla , is there any impact if we terminate scheduler in this case, similar to other component? IMO, terminate the scheduler seems a safer way to handle such kind of case :) |
@k82cn Agree, The main concern here is if this is a confirmed bug-fix or just a cleanup. could we backport a cleanup? anyway, agree with merging this first |
@ravisantoshgudimetla Thanks for your detailed explaination. Totally agree we should be "eventual consistency", and add one more principle, the "eventual" duration should be as shorter as possible. :) |
/retest Review the full test history for this PR. Silence the bot with an |
2 similar comments
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
This change is fine and we can merge it, but I agree with @gaorong that we don't need to back port it if it is a clean-up. |
I would consider this a bug and +1 on backporting |
What type of PR is this?
/kind bug
What this PR does / why we need it:
As of now, we're not terminating scheduler on loosing leader lock causing the scheduler to continue watching pod, node objects and allowing the pod to be scheduled. While this may not cause problem in single scheduler topology, this will certainly cause problems if we have HA enabled for control plane components. This PR ensures that scheduler terminates instead of silently ignoring the loss of leader lock and proceeding with scheduling. Without this patch, we'd notice that in HA deployment, schedulers would race to schedule pods(with or without leader lock) causing multiple bind failures as the individual scheduler will have a stale cache.
Thanks to @gnufied for identifying the issue.
/cc @bsalamat @Huang-Wei @ahg-g @damemi @ingvagabund
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: