[stable/elasticsearch] fix cluster outage during master termination #10687

kimxogus · 2019-01-16T03:02:27Z

What this PR does / why we need it:

This PR introduces announce service which points to each master pod and has cluster ip.
Master node will have network.publish_host as corresponding announce service's cluster ip, so that discovery service can access live ip address of terminated master.
With this PR, it seems cluster outage is gone. Http requests to client service responses immediately.

Which issue this PR fixes

(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged)

fixes [stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds #8785

Special notes for your reviewer:

Checklist

[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]

DCO signed
Chart Version bumped
Variables are documented in the README.md

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

kimxogus · 2019-01-16T03:03:48Z

/assign @desaintmartin

k8s-ci-robot · 2019-01-16T03:06:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kimxogus
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: desaintmartin

If they are not already assigned, you can assign the PR to them by writing /assign @desaintmartin in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

stable/elasticsearch/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

kimxogus · 2019-01-16T03:44:35Z

/assign @rendhalver

rendhalver · 2019-01-16T13:43:45Z

stable/elasticsearch/templates/master-svc.yaml

@@ -1,6 +1,8 @@
 apiVersion: v1
 kind: Service
 metadata:
+  annotations:
+    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"


Again I ask why are we effectively disabling the health checks to "fix" a service rather than fixing the broken health check?

master service is only for discovery service and I think it's irrelevant to success of http health check because discovery service starts before http health check succeeded.

I agree that we need to filter bad citizens.
Removed unready endpoints tolerations

rendhalver · 2019-01-16T13:46:46Z

/hold
This change will be unnecessary when we migrate to the elastic helm charts discussed here #10543

kimxogus · 2019-01-17T02:23:26Z

@rendhalver When I dived into #8785 and elastic/elasticsearch#36822 , I also tried https://github.com/elastic/helm-charts/tree/master/elasticsearch . and elastic's own helm chart had the same issue and it doesn't seem to be fixed in elastic's chart. I think this change is needed by elastic's helm chart too.

By the way, migrating elastic's charts is not just a moving helm repo, values.yaml should be rewritten, so there are many considerations needed to migrate elastic's chart like migrating with save pvs(it seems to be next priority) and there are some features that exists here should be added to elastic's chart such as lifecycle hook.

I don't think we can migrate to elastic's chart in near future, so before when we deprecate stable/elasticsearch chart, I don't think we should stop enhancing this chart.

rendhalver · 2019-01-17T18:39:42Z

Ok maybe we need to open an issue to track this and work out a fix for the problem.
I think effectively ignoring the health check isn't the right way to fix the problem.
Kubernetes relies on health checks and if we start ignoring them we are bad citizens.

The elastic helm-charts run two services which seems like a better way to solve the problem.
The run one service for cluster discovery and one service for writing to the cluster.

I realise migrating to the elastic/helm-charts won't be simple. We are going to write a migration plan to assist people to make the switch.

I would like to slow down development here and focus on working out how to migrate.

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

kimxogus · 2019-01-18T00:42:38Z

I agree that we can slow down development of this chart, and I hope I can contribute developing elastic helm chart, but I think this is a kind of major bugfix in docker environment which reduces cluster downtime from a couple of minutes to almost zero.

desaintmartin · 2019-01-21T09:01:46Z

IMHO we should start to synchronize changes we make on both charts. If the change is almost the same for both Charts, then I think it should be accepted here if it gets accepted in elastic/helm-charts.

This will allow easier switch when we have to switch.

kimxogus added 3 commits January 16, 2019 10:43

discovery headless service

2ae87e7

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

port-forward to client service

44c91a4

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

fix helm#8785

205447b

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

helm-bot added Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2019

k8s-ci-robot requested review from icereval and rendhalver January 16, 2019 03:02

kimxogus mentioned this pull request Jan 16, 2019

[stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds #8785

Closed

k8s-ci-robot assigned desaintmartin Jan 16, 2019

helm-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2019

kimxogus changed the title ~~fix #8785~~ [stable/elasticsearch] fix cluster outage during master termination #8785 Jan 16, 2019

remove unneccsary annotations

416e3d9

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

kimxogus force-pushed the fix/8785 branch from c8688c7 to 416e3d9 Compare January 16, 2019 03:29

helm-bot added Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2019

k8s-ci-robot assigned rendhalver Jan 16, 2019

kimxogus changed the title ~~[stable/elasticsearch] fix cluster outage during master termination #8785~~ [stable/elasticsearch] fix cluster outage during master termination Jan 16, 2019

rendhalver reviewed Jan 16, 2019

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2019

remove unready endpoints toleration

80671b0

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

helm-bot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 18, 2019

helm-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 18, 2019

kimxogus mentioned this pull request Jan 21, 2019

fix cluster outage, add masterService template elastic/helm-charts#41

Closed

Merge branch 'master' into fix/8785

6445efb

helm-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 28, 2019

kimxogus closed this Feb 14, 2019

andreykaipov mentioned this pull request Feb 18, 2019

Slow re-election when elected master pod is deleted elastic/helm-charts#63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stable/elasticsearch] fix cluster outage during master termination #10687

[stable/elasticsearch] fix cluster outage during master termination #10687

kimxogus commented Jan 16, 2019

kimxogus commented Jan 16, 2019

k8s-ci-robot commented Jan 16, 2019

kimxogus commented Jan 16, 2019

rendhalver Jan 16, 2019

kimxogus Jan 17, 2019

kimxogus Jan 18, 2019

rendhalver commented Jan 16, 2019

kimxogus commented Jan 17, 2019 •

edited

Loading

rendhalver commented Jan 17, 2019

kimxogus commented Jan 18, 2019

desaintmartin commented Jan 21, 2019

[stable/elasticsearch] fix cluster outage during master termination #10687

[stable/elasticsearch] fix cluster outage during master termination #10687

Conversation

kimxogus commented Jan 16, 2019

What this PR does / why we need it:

Which issue this PR fixes

Special notes for your reviewer:

Checklist

kimxogus commented Jan 16, 2019

k8s-ci-robot commented Jan 16, 2019

kimxogus commented Jan 16, 2019

rendhalver Jan 16, 2019

Choose a reason for hiding this comment

kimxogus Jan 17, 2019

Choose a reason for hiding this comment

kimxogus Jan 18, 2019

Choose a reason for hiding this comment

rendhalver commented Jan 16, 2019

kimxogus commented Jan 17, 2019 • edited Loading

rendhalver commented Jan 17, 2019

kimxogus commented Jan 18, 2019

desaintmartin commented Jan 21, 2019

kimxogus commented Jan 17, 2019 •

edited

Loading