Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

[stable/elasticsearch] fix cluster outage during master termination #10687

Closed
wants to merge 6 commits into from

Conversation

kimxogus
Copy link
Contributor

What this PR does / why we need it:

This PR introduces announce service which points to each master pod and has cluster ip.
Master node will have network.publish_host as corresponding announce service's cluster ip, so that discovery service can access live ip address of terminated master.
With this PR, it seems cluster outage is gone. Http requests to client service responses immediately.

Which issue this PR fixes

(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged)

Special notes for your reviewer:

Checklist

[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]

  • DCO signed
  • Chart Version bumped
  • Variables are documented in the README.md

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
@helm-bot helm-bot added Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2019
@kimxogus
Copy link
Contributor Author

/assign @desaintmartin

@helm-bot helm-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kimxogus
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: desaintmartin

If they are not already assigned, you can assign the PR to them by writing /assign @desaintmartin in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kimxogus kimxogus changed the title fix #8785 [stable/elasticsearch] fix cluster outage during master termination #8785 Jan 16, 2019
Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
@helm-bot helm-bot added Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2019
@kimxogus
Copy link
Contributor Author

/assign @rendhalver

@kimxogus kimxogus changed the title [stable/elasticsearch] fix cluster outage during master termination #8785 [stable/elasticsearch] fix cluster outage during master termination Jan 16, 2019
@@ -1,6 +1,8 @@
apiVersion: v1
kind: Service
metadata:
annotations:
service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I ask why are we effectively disabling the health checks to "fix" a service rather than fixing the broken health check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master service is only for discovery service and I think it's irrelevant to success of http health check because discovery service starts before http health check succeeded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we need to filter bad citizens.
Removed unready endpoints tolerations

@rendhalver
Copy link
Collaborator

/hold
This change will be unnecessary when we migrate to the elastic helm charts discussed here #10543

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2019
@kimxogus
Copy link
Contributor Author

kimxogus commented Jan 17, 2019

@rendhalver When I dived into #8785 and elastic/elasticsearch#36822 , I also tried https://github.com/elastic/helm-charts/tree/master/elasticsearch . and elastic's own helm chart had the same issue and it doesn't seem to be fixed in elastic's chart. I think this change is needed by elastic's helm chart too.

By the way, migrating elastic's charts is not just a moving helm repo, values.yaml should be rewritten, so there are many considerations needed to migrate elastic's chart like migrating with save pvs(it seems to be next priority) and there are some features that exists here should be added to elastic's chart such as lifecycle hook.

I don't think we can migrate to elastic's chart in near future, so before when we deprecate stable/elasticsearch chart, I don't think we should stop enhancing this chart.

@rendhalver
Copy link
Collaborator

Ok maybe we need to open an issue to track this and work out a fix for the problem.
I think effectively ignoring the health check isn't the right way to fix the problem.
Kubernetes relies on health checks and if we start ignoring them we are bad citizens.

The elastic helm-charts run two services which seems like a better way to solve the problem.
The run one service for cluster discovery and one service for writing to the cluster.

I realise migrating to the elastic/helm-charts won't be simple. We are going to write a migration plan to assist people to make the switch.

I would like to slow down development here and focus on working out how to migrate.

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
@helm-bot helm-bot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 18, 2019
@helm-bot helm-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 18, 2019
@kimxogus
Copy link
Contributor Author

I agree that we can slow down development of this chart, and I hope I can contribute developing elastic helm chart, but I think this is a kind of major bugfix in docker environment which reduces cluster downtime from a couple of minutes to almost zero.

@desaintmartin
Copy link
Collaborator

IMHO we should start to synchronize changes we make on both charts. If the change is almost the same for both Charts, then I think it should be accepted here if it gets accepted in elastic/helm-charts.

This will allow easier switch when we have to switch.

@helm-bot helm-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 28, 2019
@kimxogus kimxogus closed this Feb 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds
5 participants