-
Notifications
You must be signed in to change notification settings - Fork 16.7k
[stable/elasticsearch] fix cluster outage during master termination #10687
Conversation
Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
/assign @desaintmartin |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kimxogus If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
/assign @rendhalver |
@@ -1,6 +1,8 @@ | |||
apiVersion: v1 | |||
kind: Service | |||
metadata: | |||
annotations: | |||
service.alpha.kubernetes.io/tolerate-unready-endpoints: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I ask why are we effectively disabling the health checks to "fix" a service rather than fixing the broken health check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master service is only for discovery service and I think it's irrelevant to success of http health check because discovery service starts before http health check succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we need to filter bad citizens.
Removed unready endpoints tolerations
/hold |
@rendhalver When I dived into #8785 and elastic/elasticsearch#36822 , I also tried https://github.com/elastic/helm-charts/tree/master/elasticsearch . and elastic's own helm chart had the same issue and it doesn't seem to be fixed in elastic's chart. I think this change is needed by elastic's helm chart too. By the way, migrating elastic's charts is not just a moving helm repo, I don't think we can migrate to elastic's chart in near future, so before when we deprecate stable/elasticsearch chart, I don't think we should stop enhancing this chart. |
Ok maybe we need to open an issue to track this and work out a fix for the problem. The elastic helm-charts run two services which seems like a better way to solve the problem. I realise migrating to the elastic/helm-charts won't be simple. We are going to write a migration plan to assist people to make the switch. I would like to slow down development here and focus on working out how to migrate. |
Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
I agree that we can slow down development of this chart, and I hope I can contribute developing elastic helm chart, but I think this is a kind of major bugfix in docker environment which reduces cluster downtime from a couple of minutes to almost zero. |
IMHO we should start to synchronize changes we make on both charts. If the change is almost the same for both Charts, then I think it should be accepted here if it gets accepted in elastic/helm-charts. This will allow easier switch when we have to switch. |
What this PR does / why we need it:
This PR introduces announce service which points to each master pod and has cluster ip.
Master node will have
network.publish_host
as corresponding announce service's cluster ip, so that discovery service can access live ip address of terminated master.With this PR, it seems cluster outage is gone. Http requests to client service responses immediately.
Which issue this PR fixes
(optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged)Special notes for your reviewer:
Checklist
[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]