Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler should terminate on loosing leader lock #81306

Conversation

ravisantoshgudimetla
Copy link
Contributor

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind bug

What this PR does / why we need it:
As of now, we're not terminating scheduler on loosing leader lock causing the scheduler to continue watching pod, node objects and allowing the pod to be scheduled. While this may not cause problem in single scheduler topology, this will certainly cause problems if we have HA enabled for control plane components. This PR ensures that scheduler terminates instead of silently ignoring the loss of leader lock and proceeding with scheduling. Without this patch, we'd notice that in HA deployment, schedulers would race to schedule pods(with or without leader lock) causing multiple bind failures as the individual scheduler will have a stale cache.

Thanks to @gnufied for identifying the issue.

/cc @bsalamat @Huang-Wei @ahg-g @damemi @ingvagabund
Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Scheduler should terminate when it looses leader lock.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Aug 12, 2019
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 12, 2019
@ravisantoshgudimetla
Copy link
Contributor Author

/sig scheduling
/priority important

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 12, 2019
@ravisantoshgudimetla ravisantoshgudimetla changed the title Terminate on leader election Scheduler should terminate on loosing leader lock Aug 12, 2019
@ravisantoshgudimetla
Copy link
Contributor Author

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 12, 2019
@@ -262,7 +261,7 @@ func Run(cc schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}, regis
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
utilruntime.HandleError(fmt.Errorf("lost master"))
klog.Fatalf("leaderelection lost")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Now we're consistent with our neighbours:

cmd/kube-controller-manager/app/controllermanager.go
281:			OnStoppedLeading: func() {
282-				klog.Fatalf("leaderelection lost")

cmd/kube-scheduler/app/server.go
264:			OnStoppedLeading: func() {
265-				utilruntime.HandleError(fmt.Errorf("lost master"))

cmd/cloud-controller-manager/app/controllermanager.go
209:			OnStoppedLeading: func() {
210-				klog.Fatalf("leaderelection lost")

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks, @ravisantoshgudimetla for fixing this important bug. We must backport this to older versions (1.13+) of K8s.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 12, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Huang-Wei
Copy link
Member

/retest

@gaorong
Copy link
Contributor

gaorong commented Aug 13, 2019

/hold
(if I have permission)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2019
@gaorong
Copy link
Contributor

gaorong commented Aug 13, 2019

Sorry to bother you, could you please provide some real cases and how to reproduce this issue?
I have tested this and scheduler can successfully exit.
step as below:

  1. start two schedulers, say scheduler1 and scheduler2.
  2. the scheduler1 comes as the leader.
  3. break the connection between scheduler1 and apiserver using iptables: iptables -I INPUT -p tcp --dport 8083 -j DROP.
  4. scheduler1 can not schedule pod because of broken connection and scheduler2 become as the leader successfully.
  5. remove iptables from step 3 and connection recover. then scheduler1 finds himself has lost lease and EXITs successfully.
  6. node agent supervisor(e.g. systemd) restart scheduler1.

logs as below: (ignore the unsupported 1.10 version, this snippet hasn't changed since then)

I0813 09:57:56.190610    3114 server.go:562] Version: v1.10.11
I0813 09:57:56.191312    3114 server.go:581] starting healthz server on 0.0.0.0:10251
I0813 09:57:57.091874    3114 controller_utils.go:1027] Waiting for caches to sync for scheduler controller
I0813 09:57:57.191994    3114 controller_utils.go:1034] Caches are synced for scheduler controller
I0813 09:57:57.192019    3114 leaderelection.go:175] attempting to acquire leader lease  kube-system/kube-scheduler...
I0813 09:58:15.926175    3114 leaderelection.go:184] successfully acquired lease kube-system/kube-scheduler
I0813 09:59:10.889224    3114 leaderelection.go:203] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0813 09:59:10.889265    3114 server.go:619] lost master
error: lost lease

It seems all work fine.
IIUC, If scheduler loses lease, it will return from leaderElector.Run(ctx) and then invoke return fmt.Errorf("lost lease") to exit.

if cc.LeaderElection != nil {
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
utilruntime.HandleError(fmt.Errorf("lost master"))
},
}
leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
if err != nil {
return fmt.Errorf("couldn't create leader elector: %v", err)
}
leaderElector.Run(ctx)
return fmt.Errorf("lost lease")

if we want to be consistent with our neighbors, we can merge this but it targets as cleanup (not bugfix, so no need to backport).

@ravisantoshgudimetla
Copy link
Contributor Author

ravisantoshgudimetla commented Aug 13, 2019

I0813 09:59:10.889224 3114 leaderelection.go:203] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition

It depends on how long you waited before the leader looses lock, in my case the scheduler was able to communicate with apiserver before it timed out waiting for condition, the scenario I am talking about is a network condition where we would loose connectivity with apiserver for 30 seconds and then scheduler is able to communicate again, it's not a permanent network failure which would cause the above scenario you mentioned.

@gaorong
Copy link
Contributor

gaorong commented Aug 13, 2019

It depends on how long you waited before the leader looses lock

I think no matters how long the scheduler wait, once the connection recover, the scheduler will detect himself has lost lock and exit.

@ravisantoshgudimetla
Copy link
Contributor Author

ravisantoshgudimetla commented Aug 13, 2019

@gaorong - Thanks for testing the above scenario. What type of resourceLocks are you using? configmaps or endpoints? Can you simulate the same test with kube 1.12+ and see if you're able to reproduce this?

It's interesting that the scheduler is terminating for you, I looked more in 1.10 code base and I think leader election code is different.

Following are the logs from v1.14.0+0faddd8

I0810 18:54:37.287283       1 leaderelection.go:263] failed to renew lease openshift-kube-scheduler/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
E0810 18:54:37.287313       1 server.go:248] lost master
E0810 18:54:38.278507       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:kube-scheduler" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0810 18:54:38.278654       1 reflector.go:126] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:234: Failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
E0810 18:54:38.278787       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: services is forbidden: User "system:kube-scheduler" cannot list resource "services" in API group "" at the cluster scope
I0810 18:55:22.386588       1 scheduler.go:417] Failed to bind pod: test-insights/insights-operator-9868bfd69-kklx8
E0810 18:55:22.386635       1 factory.go:660] Error scheduling test-insights/insights-operator-9868bfd69-kklx8: Operation cannot be fulfilled on pods/binding "insights-operator-9868bfd69-kklx8": pod insights-operator-9868bfd69-kklx8 is already assigned to node "test-hlb8q-master-0"; retrying

You can clearly see that after lost master, we're not exiting from scheduler and continuing with scheduling.

What's happening?

The code path is slightly different in leader election code, scheduler in your case is failing at lock acquistion state while renewing and the channel is closing right after that here - https://github.com/kubernetes/kubernetes/blob/release-1.10/staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go#L162

We moved to contexts from channels in #57932 1.12+ and I think this line

, won't always cancel the parent context causing explicit exit in the component that calls the leader election code.

@gaorong
Copy link
Contributor

gaorong commented Aug 14, 2019

What type of resourceLocks are you using? configmaps or endpoints?

the default: endpoint

Can you simulate the same test with kube 1.12+ and see if you're able to reproduce this?

Yes, scheduler can still terminate as before.
I run v1.15.0 scheduler as a foreground process with minimal configuration.

logs as below:

[root@k8s-test-master01 ]#  /usr/bin/kube-scheduler  --v=0 --master=http://127.0.0.1:8080
I0814 10:17:08.141925   11894 serving.go:319] Generated self-signed cert in-memory
W0814 10:17:08.313407   11894 authentication.go:249] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0814 10:17:08.313424   11894 authentication.go:252] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0814 10:17:08.313439   11894 authorization.go:146] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0814 10:17:08.314082   11894 server.go:142] Version: v1.15.0
I0814 10:17:08.314117   11894 defaults.go:87] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
W0814 10:17:08.315223   11894 authorization.go:47] Authorization is disabled
W0814 10:17:08.315234   11894 authentication.go:55] Authentication is disabled
I0814 10:17:08.315244   11894 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0814 10:17:08.315639   11894 secure_serving.go:116] Serving securely on [::]:10259
I0814 10:17:09.218792   11894 leaderelection.go:235] attempting to acquire leader lease  kube-system/kube-scheduler...
I0814 10:17:24.964249   11894 leaderelection.go:245] successfully acquired lease kube-system/kube-scheduler
E0814 10:19:09.392164   11894 leaderelection.go:324] error retrieving resource lock kube-system/kube-scheduler: Get http://127.0.0.1:8080/api/v1/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0814 10:19:09.392245   11894 event.go:296] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'k8s-test-master01_bc8bdc92-a9d8-423d-9177-e48bc60b3f80 stopped leading'
I0814 10:19:09.392515   11894 leaderelection.go:281] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0814 10:19:09.392538   11894 server.go:254] lost master
lost lease
[root@k8s-test-master01 ]# 

it's weird to have different behavior. I think leader-election should be compatible with previous behavior (in v1.10).
Are there some special configurations for you?

@ravisantoshgudimetla
Copy link
Contributor Author

I run v1.15.0 scheduler as a foreground process with minimal configuration.

Do you have connectivity to apiserver from scheduler, looking at the logs, it seems apiserver is not running. The scenario is specific to situation where the scheduler acquired lock and releasing it back.

Are there some special configurations for you?

I am using configmaps instead of endpoints/

@gaorong
Copy link
Contributor

gaorong commented Aug 14, 2019

Do you have connectivity to apiserver from scheduler, looking at the logs, it seems apiserver is not running

apiserver is listening on port 8080, I drop all tcp packages sending to apiserver by iptables: iptables -I INPUT -p tcp --dport 8080 -j DROP. maybe this is equivalent to apiserver is down. as shown in the scheduler flag in above example, the scheduler is connecting to apiserver by localhost, so when breaking one of apiservers' connectivity, other schedulers can still successfully connect to apiserver within the same host and acquire the lock.

The scenario is specific to situation where the scheduler acquired lock and releasing it back.

How can we simulate this scenario?

@gaorong
Copy link
Contributor

gaorong commented Aug 14, 2019

The scenario is specific to situation where the scheduler acquired lock and releasing it back.

Do you set ReleaseOnCancel to true by changing scheduler code? the scheduler would not release lock proactively.

// ReleaseOnCancel should be set true if the lock should be released
// when the run context is cancelled. If you set this to true, you must
// ensure all code guarded by this lease has successfully completed
// prior to cancelling the context, or you may have two processes
// simultaneously acting on the critical path.
ReleaseOnCancel bool

@yqwang-ms
Copy link
Contributor

yqwang-ms commented Aug 14, 2019

@ravisantoshgudimetla I am not familiar with kube-scheduler in HA mode, but I have a few small questions, could you please help:

Shouldn't the previous leader be fenced when the new leader is selected?

Seems the previous leader just exit as soon as possible, still cannot totally avoid your race condition that multiple scheduler are scheduling pods at the same time, right?
Such as, the binding request from the previous leader is being sending to ApiServer, exit the process may not be able to cancel the sending (at least when the request is already sent to network).

So, can I consider this PR is still best effort to avoid the race condition above, but it largerly reduces the chance, right?
BTW, seems kube-scheduler do not need to totally avoid it (by fencing etc)?

@ravisantoshgudimetla
Copy link
Contributor Author

so when breaking one of apiservers' connectivity, other schedulers can still successfully connect to apiserver within the same host and acquire the lock.

How're the other schedulers connecting to apiserver? Are they running on the same host? If you block inbound traffic on port 8080, they'd also loose connectivity, isn't it?

How can we simulate this scenario?

I am currently using a 3 node control plane cluster, where each scheduler on every node connects to apiserver running locally. As of now, I am bouncing apiserver one node at a time to simulate this scenario

So, can I consider this PR is still best effort to avoid the race condition above, but it largerly reduces the chance, right?
BTW, seems kube-scheduler do not need to totally avoid it (by fencing etc)?

Good question, while I am not certain that scenario can happen in the above situation, there is a very good chance that apiserver received the request just before scheduler exited, the other scheduler which gained the lock won't send the request again because it'll have to wait for the informer caches to sync initially(actually even before acquiring the lock). The informers would talk to apiserver and get the latest state of pod before building the local cache and start scheduling. The main problem in the situation that I mentioned here is scheduler without lock is continuously sending requests to apiserver for binding or trying to reschedule pods that are already scheduled by the right scheduler (the one which has leader lock).

But in general, kubernetes is based on eventual consistency, where failure can happen sometimes because of various reasons but system should correct itself after some time.

@gaorong
Copy link
Contributor

gaorong commented Aug 15, 2019

Are they running on the same host?

No, there aren't. my environment is same as you: each scheduler on every node connects to apiserver running locally.

@k82cn
Copy link
Member

k82cn commented Aug 15, 2019

@gaorong , @ravisantoshgudimetla , is there any impact if we terminate scheduler in this case, similar to other component? IMO, terminate the scheduler seems a safer way to handle such kind of case :)

@gaorong
Copy link
Contributor

gaorong commented Aug 15, 2019

@k82cn Agree, The main concern here is if this is a confirmed bug-fix or just a cleanup. could we backport a cleanup?

anyway, agree with merging this first
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2019
@yqwang-ms
Copy link
Contributor

@ravisantoshgudimetla Thanks for your detailed explaination. Totally agree we should be "eventual consistency", and add one more principle, the "eventual" duration should be as shorter as possible. :)

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

2 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@bsalamat
Copy link
Member

This change is fine and we can merge it, but I agree with @gaorong that we don't need to back port it if it is a clean-up.

@alculquicondor
Copy link
Member

I would consider this a bug and +1 on backporting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants