Add ingester.autoforget-unhealthy-timeout opt-in feature #3919

github-vincent-miszczak · 2021-06-30T15:12:54Z

Follow up of #3661, I was too long to push an update and the PR was automatically closed :(

What this PR does / why we need it:
It adds a new parameter that allows to automatically forget unhealthy ingesters after heartbeat timeout.

Which issue(s) this PR fixes:
Partially fixes #3360

Special notes for your reviewer:
Not everyone is using stateful sets. In my case, I'm running on AWS ECS in a stateless configuration. If one of my ingester get killed for whatever reason, this is not a drama because I have RF=3. A new task is spawned.

At the moment the failed ingester will remain in the ring forever and requires to push forget button on ingester /ring. If not, another task failing will make the cluster unavailable even if 2 out of 3 are healthy because Loki does not use a sloppy quorum.

Checklist

Documentation added
Tests updated

docs/sources/configuration/_index.md

bt909 · 2021-07-02T08:54:40Z

I have tested it, because I was also looking for a feature like this. It works well. Thank you.

Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com>

owen-d

LGTM

Implementation adapted from grafana/loki#3919. Related to cortexproject#1521. Signed-off-by: Josh Carp <jm.carp@gmail.com>

pull-request-size bot added the size/L label Jun 30, 2021

Add ingester.autoforget-unhealthy-timeout opt-in feature

00ca779

github-vincent-miszczak force-pushed the unhealthy-ingester-removal branch from 34e05cf to 00ca779 Compare July 1, 2021 16:44

github-vincent-miszczak marked this pull request as ready for review July 1, 2021 16:56

github-vincent-miszczak requested a review from KMiller-Grafana as a code owner July 1, 2021 16:56

KMiller-Grafana reviewed Jul 1, 2021

View reviewed changes

docs/sources/configuration/_index.md Outdated Show resolved Hide resolved

Update docs/sources/configuration/_index.md

55be01e

Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com>

dbluxo mentioned this pull request Jul 9, 2021

"too many failed ingesters" using memberlist #3360

Closed

owen-d approved these changes Jul 12, 2021

View reviewed changes

Merge branch 'main' into unhealthy-ingester-removal

74156e6

owen-d merged commit d6dc8b9 into grafana:main Jul 12, 2021

ctorrisi mentioned this pull request Oct 20, 2021

the ring never removes old ingester even if the ingester pod is evicted cortexproject/cortex#1521

Open

jmcarp mentioned this pull request Feb 4, 2022

distributor: optionally auto-forget unhealthy instances cortexproject/cortex#4641

Closed

3 tasks

jmcarp added a commit to jmcarp/cortex that referenced this pull request Feb 4, 2022

distributor: optionally auto-forget unhealthy instances

7f85125

Implementation adapted from grafana/loki#3919. Related to cortexproject#1521. Signed-off-by: Josh Carp <jm.carp@gmail.com>

chenfeilee mentioned this pull request Feb 7, 2022

autoforget_unhealthy for Tempo ingesters grafana/tempo#1275

Open

RECturtle mentioned this pull request Aug 31, 2022

Setting to automatically forget unhealthy ring members grafana/mimir#2869

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ingester.autoforget-unhealthy-timeout opt-in feature #3919

Add ingester.autoforget-unhealthy-timeout opt-in feature #3919

github-vincent-miszczak commented Jun 30, 2021

bt909 commented Jul 2, 2021 •

edited

Loading

owen-d left a comment

Add ingester.autoforget-unhealthy-timeout opt-in feature #3919

Add ingester.autoforget-unhealthy-timeout opt-in feature #3919

Conversation

github-vincent-miszczak commented Jun 30, 2021

bt909 commented Jul 2, 2021 • edited Loading

owen-d left a comment

Choose a reason for hiding this comment

bt909 commented Jul 2, 2021 •

edited

Loading