the ring never removes old ingester even if the ingester pod is evicted #1521

wuyafang · 2019-07-19T04:03:20Z

I have a similar problem as #1502
when my ingester pod was evicted , a new ingester pod will be created .
now the ring has two ingester, but only one (the new one) is healthy. the old one will not be removed from the ring, even if I delete the evict pod manually.
the ring information as follows:

`

                                        <tr>
						<td>ingester-7fc8759d7f-nzb6g</td>
						<td>ACTIVE</td>
						<td>172.16.0.62:9095</td>
						<td>2019-07-19 03:33:32 &#43;0000 UTC</td>
						<td>128</td>
						<td>45.739077787319914%</td>
						<td><button name="forget" value="ingester-7fc8759d7f-nzb6g" type="submit">Forget</button></td>
					</tr>
					
					<tr>
						<td>ingester-7fc8759d7f-wmnms</td>
						<td>Unhealthy</td>
						<td>172.16.0.93:9095</td>
						<td>2019-07-18 14:46:18 &#43;0000 UTC</td>
						<td>128</td>
						<td>54.260922212680086%</td>
						<td><button name="forget" value="ingester-7fc8759d7f-wmnms" type="submit">Forget</button></td>
					</tr>

`
and the ingester's status is always unready， with distributor's error

level=warn ts=2019-07-19T03:41:45.413839063Z caller=server.go:1995 traceID=daf4028f530860f msg="POST /api/prom/push (500) 727.847µs Response: \"at least 1 live ingesters required, could only find 0\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 3742; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.11.0; X-Forwarded-For: 172.16.0.17; X-Forwarded-Host: perf.monitorefk.huawei.com; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Original-Uri: /api/prom/push; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 172.16.0.17; X-Request-Id: 62a470dc6de7a83c8974e3411fa63e40; X-Scheme: https; X-Scope-Orgid: custom; "

I wonder if there is any solution to deal with the situaton automatically?
maybe to check the replicas-refactor and remove unhealthy excess ingesters from the ring?

The text was updated successfully, but these errors were encountered:

bboreham · 2019-07-19T05:55:36Z

If the ingester shut down cleanly, even on eviction, then it would not be in the ring. So, the first task is to find out why it did not shut down cleanly, and if possible fix that.

Everything else you report is deliberate. We return not-ready to halt a rolling update.

bboreham · 2019-07-19T05:57:38Z

Actually I don’t understand could only find 0, since your ring shows 1 active.

bboreham · 2019-07-19T08:27:14Z

I mean the ingester went through its exit sequence, rather than being abruptly terminated from outside.

There are two main cases: hand-over to another ingester, and flush to store. In both cases the time required is a function of how much data is in memory.

When using an explicitly provisioned store (eg DynamoDB) it would be nice to scale up specifically for a “save everything” operation. There’s no code to do that currently.

wuyafang · 2019-07-19T09:08:05Z

I try to reproduce the problem by delete pod --force. and a new ingester pod is produced by deployment controller immediately.
nowI get an ring has two ingester(one is active,the other is unhealthy)，an unready state for my new ingester(because there is an unhealthy pod, so 503 returned for readiness check), and a distributor log like this
level=warn ts=2019-07-19T08:18:25.511613228Z caller=logging.go:49 traceID=5eabc2f0f6b837e9 msg="POST /api/prom/push (500) 266.759µs Response: "at least 2 live ingesters required, could only find 1\n" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5162; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.11.1; X-Forwarded-For: 100.95.185.106; X-Forwarded-Host: 100.95.137.223; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Original-Uri: /api/prom/push; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 100.95.185.106; X-Request-Id: 89b05193a60c7935ac6a7bcd090b9a16; X-Scheme: https; X-Scope-Orgid: primary

I'm confused because my -distributor.replication-factor
is 1, so by minSuccess := (replicationFactor / 2) + 1 , my distributor only need at 1 live ingester, but the log tells me I need two.

so is there anything I misunderstood?

I wonder when the ring adds ingester and when to remove? Is consul do it by itself , or ingester tell it what to do? I notice when ingester start and shutdown, it will tell ring. But what if the ingester is shutdown unclearly ,is there any solutions to automatically clean the unhealthy pod in the ring ?

by the way , after I restart my consul , the ring will only have the active one and anything works well.

wuyafang · 2019-07-19T09:24:41Z

I know...
you do this

        replicationFactor := r.cfg.ReplicationFactor
	if len(ingesters) > replicationFactor {
		replicationFactor = len(ingesters)
	}

so the replicationFactor is 2 now, instead of what I set in -distributor.replication-factor.
it is in case of node joining/leaving， but will cause write failure in my case above.

bboreham · 2019-07-19T09:33:25Z

That sounds like the same problem as #1290
@tomwilkie can you remember what that check is for?

cortex/pkg/ring/replication_strategy.go

Line 20 in 7cf0690

if len(ingesters) > replicationFactor {

YaoZengzeng · 2019-07-29T13:19:55Z

Actually If I deploy one ingester and replicationFactor is 1, then ingester pod was evicted because of low memory and kubelet restart another ingester pod.

However the previous ingester didn't exit cleanly, actually the corresponding entry in the ring of consul will never be cleaned. So at this moment:

replicationFactor == len(ingesters) = 2 (1 eviceted ingester and 1 running ingester)

minSuccess = (replicationFactor / 2) + 1 = 2

However len(liveIngesters) = 1 < minSuccess ---> There is is a deadlock: unhealthy ingester never cleaned from the ring and we'll never reach minSuccess.

Two problems here:

Why len(ingester) > replicationFactor, then replicationFactor = len(ingesters). ---> In fact, the ingesters more than original replicationFactor is usually not normal.
If the ingester is unhealthy for a long time, why not clean it out from the ring. ---> The residual ingester will affect the replication strategy.

@bboreham @tomwilkie @csmarchbanks Any ideas ?

bboreham · 2019-07-29T13:26:08Z

The current design requires that you set terminationGracePeriodSeconds long enough to shut down the first ingester cleanly.

Your point 1 seems the same as #1290

Point 2 because we don't have enough experience of situations that need this. We would probably add it as an option if someone was to submit a PR.

YaoZengzeng · 2019-07-29T13:37:47Z

@bboreham If ingester is killed because of OOM (Actually ingester consume a lot memory and it's very common in k8s, at least very common in my k8s environment), then it will never have terminationGracePeriodSeconds to shut down gracefully 😂

For point 1, I think replicationFactor is configured by user, so set it constant maybe more reasonable.

For point 2, I may need to read more code to better understand the design intent. If it's necessary, I'd like to make an PR to fix it.

bboreham · 2019-07-29T14:51:11Z

"killed because of OOM" is not the same thing as "evicted". A pod that is OOM-killed will restart with the same identity on the same node, hence pick up the same entry in the Cortex ring.
Unless you have evidence to the contrary?

YaoZengzeng · 2019-07-30T01:26:54Z

@bboreham You are right.

In our environment, the kubelet was configured with hard eviction, so the ingester pod was evicted without graceful period.

However even configure kubelet with soft eviction, I have no idea of how to configure the eviction-max-pod-grace-period. Because the grace period needed by ingester may related to the amount of data it contains.

If after configured grace period, ingester still can't exit cleanly. Then the problem still can't be solved.

miklosbarabasForm3 · 2019-07-30T15:01:07Z

Hi
Apparently I ran into the same issue, although I saw msg="error removing stale clients" err="too many failed ingesters" as well in the logs.

Having a look at the code my assumption is the following (please correct me if im wrong):

pkg/ingester/client/pool.go#removeStaleClients() calls to p.ring.GetAll()
pkg/ring/ring.go#GetAll() returns error with too many failed ingesters because of the way how the logic is implemented using maxErrors.

Some questions:

why does GetAll need to check against the non-healthy instances at all?

if that check is needed, what is the purpose behind calculating the maxErrors from the Unhealthy instances instead of calculating the ACTIVE ones and using the ReplicationFactor?

cortex/pkg/ring/ring.go

Lines 258 to 286 in 1ca4ad0

    
           // GetAll returns all available ingesters in the ring. 
        
           func (r *Ring) GetAll() (ReplicationSet, error) { 
        
           	r.mtx.RLock() 
        
           	defer r.mtx.RUnlock() 
        
           	if r.ringDesc == nil || len(r.ringDesc.Tokens) == 0 { 
        
           		return ReplicationSet{}, ErrEmptyRing 
        
           	} 
        
           	ingesters := make([]IngesterDesc, 0, len(r.ringDesc.Ingesters)) 
        
           	maxErrors := r.cfg.ReplicationFactor / 2 
        
           	for _, ingester := range r.ringDesc.Ingesters { 
        
           		if !r.IsHealthy(&ingester, Read) { 
        
           			maxErrors-- 
        
           			continue 
        
           		} 
        
           		ingesters = append(ingesters, ingester) 
        
           	} 
        
           	if maxErrors < 0 { 
        
           		return ReplicationSet{}, fmt.Errorf("too many failed ingesters") 
        
           	} 
        
           	return ReplicationSet{ 
        
           		Ingesters: ingesters, 
        
           		MaxErrors: maxErrors, 
        
           	}, nil 
        
           }

After changing the aforementioned code (line 278-280) to the following, I stopped receiving "error removing stale client":

if len(ingesters) < r.cfg.ReplicationFactor / 2 + 1 {
    return ReplicationSet{}, fmt.Errorf("not enough healthy ingesters (ingesters: %d, replicationFactor: %d)", len(ingesters), r.cfg.ReplicationFactor)
}

also when I set distributor.health-check-ingesters enabled, the Unhealthy ingesters got cleaned up properly. (so #1264 might be related as well) EXCEPT when it was an OOM or using ECS and your ingester comes back with the same identity (as you mentioned), on the same IP/host. Meaning there's no logic to transition from LEAVING to ACTIVE again. Shouldn't there be some logic created around that?

Note that distributor.client-cleanup-period (def 15s) & distributor.health-check-ingesters (def: false) controls the ingester cleanup, which will clean up Unhealthy ingesters if you have the health-check-ingesters enabled.

cortex/pkg/ingester/client/pool.go

Lines 75 to 92 in 1ca4ad0

    
           func (p *Pool) loop() { 
        
           	defer p.done.Done() 
        
           	cleanupClients := time.NewTicker(p.cfg.ClientCleanupPeriod) 
        
           	defer cleanupClients.Stop() 
        
           	for { 
        
           		select { 
        
           		case <-cleanupClients.C: 
        
           			p.removeStaleClients() 
        
           			if p.cfg.HealthCheckIngesters { 
        
           				p.cleanUnhealthy() 
        
           			} 
        
           		case <-p.quit: 
        
           			return 
        
           		} 
        
           	} 
        
           }

wuyafang · 2019-07-31T08:59:44Z

Note that distributor.client-cleanup-period (def 15s) & distributor.health-check-ingesters (def: false) controls the ingester cleanup, which will clean up Unhealthy ingesters if you have the health-check-ingesters enabled.

I tried this. but it just removes unhealthy ingester from distirbutor pool( which holds ingester clients ) instead of removing them from consul ring. It doesn't work for me.

bboreham · 2019-07-31T09:02:58Z

Don't read too much into the words - that's removing them from one data structure in memory. There is no code to remove ingesters from the ring when they are suspected to be dead, and this was deliberate.

miklosbarabasForm3 · 2019-07-31T09:43:43Z

@bboreham
what was the idea behind not removing ingesters from the ring when they are suspected to be dead?

and then what is the purpose of removeStaleClients and cleanUnhealhty ? (which is just removing the unhealthy ingesters from the distributor pool only)

bboreham · 2019-07-31T12:58:23Z

what was the idea behind not removing ingesters from the ring when they are suspected to be dead?

Risky, easy to get wrong, not necessary day one.

and then what is the purpose of removeStaleClients

that was to fix #217

bboreham · 2019-08-25T16:37:36Z

Here's an example scenario we want to avoid: Cortex is running under Kubernetes, and a rolling update begins:

Kubernetes sends SIGTERM to one old ingester and starts one new ingester.
A bug in the new code means hand-over from old to new fails.
Terminating ingester starts to flush all data, but can't flush quickly enough and runs out of time.
Kubernetes terminates the first ingester and removes its pod entry.

Now, if we allow the rolling update to proceed, the same thing will happen in each case and we will lose the unflushed data from all ingesters, which could be a significant proportion of all data in the last 12 hours.

With the current code the rolling update is halted because there will be an "unhealthy" entry for the old ingester in the ring, and this means the new ingester will never show "ready" to Kubernetes.

YaoZengzeng · 2019-08-26T02:34:52Z

@bboreham Yes, it's exactly the scenario we encountered and it's annoying.

bboreham · 2019-08-26T06:48:42Z

I think you would find losing half the data more annoying than having to operate the system manually when there is a fault.

siennathesane · 2019-10-19T20:51:28Z

I also hit this issue. I was able to work around it by completely wiping the slate clean, but it's not ideal.

bboreham · 2019-10-20T10:07:59Z

If you indeed hit the same issue please follow the steps in #1521 (comment)

If your issue is different please file it separately.

stale · 2020-02-03T11:56:31Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

jgraettinger · 2020-06-07T03:06:49Z

Hi -- FYI I've found this /ready behavior plays badly with StatefulSets using an "Ordered" Pod Management Policy (the default). I believe the fix is easy -- use a "Parallel" policy -- but documenting the problematic scenario:

Suppose you have 3 SS replicas with "Ordered" policy:

pod-0, pod-1, pod2 are all running
pod-1 & pod-2 have the power yanked at (approximately) the same time
pod-1 is re-started by the SS replica controller
pod-0 is marked as unhealthy, because it can't talk to pod-2
pod-1 becomes healthy
The replica controller is wedged because pod-0 is still unhealthy. pod-2 is never started

I experienced this running with preemptible nodes (I know, I know) and confirmed with manual testing. If the "Parallel" policy is used instead then pod-1 & pod-2 start in parallel and pick up their former places in the ring.

pracucci · 2020-06-08T06:39:40Z

pod-0 is marked as unhealthy, because it can't talk to pod-2

Why is pod-0 marked as unhealthy? I can't understand this.

bboreham · 2021-07-27T12:40:27Z

Now that chunks storage is deprecated and we use blocks storage, we no longer "hand-over" from one ingester to another.
So one justification for this behaviour has disappeared.

Happy to hear experience reports from people who did automate it.

ctorrisi · 2021-10-20T06:31:40Z

The ingester.autoforget_unhealthy configuration item exists in Loki since this pull request was merged grafana/loki#3919.

Would it be possible to add the same functionality into Cortex?

Or is there another way to facilitate the same behaviour as Loki's ingester.autoforget_unhealthy?

ebr · 2021-11-11T20:25:42Z

I've read through this issue and the linked issues, and it's still unclear to me whether there is a way to have the ingester ring self-heal in case of unclean shutdowns. Not needing human operator intervention would be extremely valuable to us, as we are losing much more data due to ingesters being down compared to what we would lose by auto-forgetting unhealthy ingesters from the ring.

rafilkmp3 · 2021-11-12T14:23:52Z

+1

rafilkmp3 · 2021-11-12T14:50:08Z

ingester.autoforget_unhealthy sure will fix restart pods, or cortex pods can register itself always with same same to avoid this scenario

│ cortex-ingester-84c7969655-kdbqq level=warn ts=2021-11-12T14:47:55.531248888Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resol │
│ ved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-84c7969655-kzvr9 past heartbeat timeout"                                         │
│ cortex-ingester-84c7969655-kdbqq level=warn ts=2021-11-12T14:48:25.533870058Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resol │
│ ved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-84c7969655-kzvr9 past heartbeat timeout"                                         │
│ cortex-ingester-84c7969655-m9tnz level=warn ts=2021-11-12T14:47:56.207233612Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resol │
│ ved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-84c7969655-9p7pj past heartbeat timeout"                                         │
│ cortex-ingester-84c7969655-m9tnz level=warn ts=2021-11-12T14:48:26.207033877Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resol │
│ ved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-84c7969655-9p7pj past heartbeat timeout"                                         │
│ cortex-ingester-84c7969655-kdbqq level=warn ts=2021-11-12T14:48:55.528262784Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resol │
│ ved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-84c7969655-kzvr9 past heartbeat timeout"                                         │
│ cortex-ingester-84c7969655-m9tnz level=warn ts=2021-11-12T14:48:56.205825525Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resol │
│ ved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-84c7969655-9p7pj past heartbeat timeout"                                         │
│ cortex-ingester-84c7969655-clfgm level=warn ts=2021-11-12T14:48:56.240787216Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resol │
│ ved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-84c7969655-kzvr9 past heartbeat timeout"

jpikoulas · 2021-11-23T10:41:21Z

ingester.autoforget_unhealthy will be amazing as deploying to AWS with spot instances, get ingesters destroyed and re span up. Exposing the Cortex Ring Status web interface to manually remove unhealthy ingesters is not practical , and it is a security concern.

stewartshea · 2021-11-24T21:05:24Z

@rafilkmp3 Thanks for your input on that... I'm using k8s for this and will switch the ingestors to a statefulset which should fix this issue (forcing the pods into a consistent name). The other approach was going to be a quick job that would query the endpoint and remove the unhealthy ingestors, but the statefulset approach feels much cleaner.

bboreham · 2021-11-26T16:36:17Z

whether there is a way to have the ingester ring self-heal in case of unclean shutdowns.

Nobody has coded one for Cortex, to my knowledge.

deploying to AWS with spot instances

We tell you not to do this in the docs.

jmcarp · 2021-12-23T18:15:39Z

I would be happy to take a stab at writing ingester.autoforget_unhealthy based on the loki implementation if the maintainers think it makes sense.

alanprot · 2022-01-06T17:32:15Z

+1 for this feature.

This is useful specially in the distributor ring - distributor is totally safe to be forgotten if is unhealthy for a long time (2 day). In this case is safe to assume it was an unclean shutdown and it will never come back.

Another thing is in the newest cortex release a we introduced the cortex_ring_member_ownership_percent metrics for distributors (before this metric was only for ingesters) and these metric is there even for unhealthy distributors - creating unnecessary timeseries and causing distributors to use more cpu when been scraped.

Implementation adapted from grafana/loki#3919. Related to cortexproject#1521. Signed-off-by: Josh Carp <jm.carp@gmail.com>

rafilkmp3 · 2022-02-14T20:53:01Z

#1521 (comment)

How you did this ? can you share your conf ?

rafilkmp3 · 2022-02-14T20:53:34Z

I would be happy to take a stab at writing ingester.autoforget_unhealthy based on the loki implementation if the maintainers think it makes sense.

Would be nice

Rahuly360 · 2022-04-07T07:11:15Z

whether there is a way to have the ingester ring self-heal in case of unclean shutdowns.

Nobody has coded one for Cortex, to my knowledge.

deploying to AWS with spot instances

We tell you not to do this in the docs.

Is there any way to auto forget unhealthy rings in Cortex?

sspreitzer · 2022-09-13T15:01:03Z

In a Kubernetes & Helm based scenario, these Helm values could be a workaround:

ingester:
  initContainers:
    - name: cleanup-unhealthy-ingesters
      image: alpine
      command:
        - sh
        - -c
        - 'apk add curl jq && curl -H "Accept: application/json" http://cortex-distributor:8080/ingester/ring | jq ".shards[] | select(.state==\"UNHEALTHY\") | .id" | xargs -I{} curl -d "forget={}" -H "Accept: application/json" http://cortex-distributor:8080/ingester/ring'

Please be aware that you need to change the two urls in conformance to your Helm release name. Here it is cortex, so the url is http://{{ .Release.Name }}-distributor:8080/ingester/ring.
Please test thoroughly and contribute your enhancements.

sspreitzer · 2022-10-04T12:27:10Z

We ended up adding these Kubernetes resources for an automatic cleanup of unhealthy ingesters:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cortex-ingester-cleanup-script
  namespace: cortex
data:
  script: |
    while true; do
      which curl > /dev/null 2>&1
      if [ $? -eq 1 ]; then
        apk add curl
      fi
      which jq > /dev/null 2>&1
      if [ $? -eq 1 ]; then
        apk add jq
      fi

      curl -H "Accept: application/json" http://cortex-distributor:8080/ingester/ring | 
        jq ".shards[] | select(.state==\"Unhealthy\") | .id" |
        sed 's|"||g' |
        xargs -I{} curl -d "forget={}" -d 'csrf_token=$__CSRF_TOKEN_PLACEHOLDER__' -H "Accept: application/json" http://cortex-distributor:8080/ingester/ring
      
      sleep 3
    done
    true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cortex-ingester-cleanup
  namespace: cortex
  labels:
    app: cortex-ingester-cleanup
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cortex-ingester-cleanup
  template:
    metadata:
      labels:
        app: cortex-ingester-cleanup
        revision: '1'
    spec:
      containers:
        - name: cortex-ingester-cleanup
          image: alpine
          resources:
            limits:
              cpu: 500m
              memory: 512Mi
          command:
            - sh
            - -c
            - "apk add bash && exec bash /cortex-ingester-cleanup.sh"
          volumeMounts:
            - name: cortex-ingester-cleanup-script
              mountPath: /cortex-ingester-cleanup.sh
              subPath: script
      volumes:
        - name: cortex-ingester-cleanup-script
          configMap:
            name: cortex-ingester-cleanup-script

kingtran2112 · 2022-11-29T06:52:02Z

Why is pod-0 marked as unhealthy? I can't understand this.

I'm not sure. Looking at the code, the /ready state is supposed to latch. My observations in a couple runs of the above were that it didn't, or could become unlatched somehow.

I'm asking because we also run ingesters as statefulsets (in several clusters) and we've never experienced the issue you're describing, so I'm trying to understand how we could reproduce such scenario. Once an ingester switches to ready it should never get back to not-ready, unless a critical issue occurs. Do you see any valuable information in the logs of the ingester switching from ready to not-ready?

I think the reason is when 2 pods are terminated at the same time, then with the ordered policy, one pod will start first. That pod will be shown at ACTIVE in the ring but in the k8s side, it is not ready. I checked the log of that pod and it showed this log
level=warn ts=2022-11-29T06:27:48.304675108Z caller=lifecycler.go:239 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance my-cortex-ingester-1 past heartbeat timeout"
Then, because the "new" started pod is not ready, k8s will not start the second pod. Then after a while, the state of the second pod changing to Unhealthy. And we have a deadlock there, the first pod is not ready because the second pod is down. And the second pod does not restart because the first pod is not ready

alex-berger · 2023-08-14T14:38:15Z

Got bitten bit this terribly several times now, and lost a lot of time and data :-(, would really love to see ingester.autoforget_unhealthy support in Cortex.

fhperuchi · 2024-12-09T15:56:17Z

Where do I find the value for __CSRF_TOKEN_PLACEHOLDER__?

This comment has been minimized.

Sign in to view

JohnCMcDonough mentioned this issue Jan 22, 2020

Azure Blob - Invalid Chunk Checksum #2024

Closed

bboreham mentioned this issue Jan 31, 2020

rulers are never removed from the ring if evicted #2058

Closed

stale bot added the stale label Feb 3, 2020

bboreham added the keepalive Skipped by stale bot label Feb 3, 2020

stale bot removed the stale label Feb 3, 2020

bboreham added the not-as-easy-as-it-looks label Aug 20, 2020

github-vincent-miszczak mentioned this issue Feb 21, 2021

"too many failed ingesters" using memberlist grafana/loki#3360

Closed

sherifkayad mentioned this issue May 11, 2021

[loki-distributed] Change the podManagementPolicy of the StatefulSets to "Parallel" grafana/helm-charts#435

Closed

This was referenced Oct 1, 2021

Extend replication when instance heartbeat timeout #4493

Closed

Ingester not leaving Ring From Distributor Rings And getting "expanding series: too many unhealthy instances in the ring" #4476

Closed

bboreham mentioned this issue Dec 22, 2021

Easier Way to Clean Unhealthy Instances from Ring #4591

Closed

jmcarp mentioned this issue Feb 4, 2022

distributor: optionally auto-forget unhealthy instances #4641

Closed

3 tasks

jmcarp added a commit to jmcarp/cortex that referenced this issue Feb 4, 2022

distributor: optionally auto-forget unhealthy instances

7f85125

Implementation adapted from grafana/loki#3919. Related to cortexproject#1521. Signed-off-by: Josh Carp <jm.carp@gmail.com>

chenfeilee mentioned this issue Feb 7, 2022

autoforget_unhealthy for Tempo ingesters grafana/tempo#1275

Open

patsevanton mentioned this issue Jan 26, 2023

[loki-distributed] too many unhealthy instances in the ring grafana/helm-charts#2154

Open

friedrichg mentioned this issue May 30, 2023

distributor replica detect instance error #4923

Closed

the ring never removes old ingester even if the ingester pod is evicted #1521

the ring never removes old ingester even if the ingester pod is evicted #1521

Comments

wuyafang commented Jul 19, 2019 • edited Loading

bboreham commented Jul 19, 2019

bboreham commented Jul 19, 2019

This comment has been minimized.

bboreham commented Jul 19, 2019

wuyafang commented Jul 19, 2019

wuyafang commented Jul 19, 2019 • edited Loading

bboreham commented Jul 19, 2019

YaoZengzeng commented Jul 29, 2019

bboreham commented Jul 29, 2019

YaoZengzeng commented Jul 29, 2019

bboreham commented Jul 29, 2019

YaoZengzeng commented Jul 30, 2019

miklosbarabasForm3 commented Jul 30, 2019 • edited Loading

wuyafang commented Jul 31, 2019

bboreham commented Jul 31, 2019

miklosbarabasForm3 commented Jul 31, 2019

bboreham commented Jul 31, 2019 • edited Loading

bboreham commented Aug 25, 2019

YaoZengzeng commented Aug 26, 2019

bboreham commented Aug 26, 2019

siennathesane commented Oct 19, 2019

bboreham commented Oct 20, 2019

stale bot commented Feb 3, 2020

jgraettinger commented Jun 7, 2020

pracucci commented Jun 8, 2020

bboreham commented Jul 27, 2021

ctorrisi commented Oct 20, 2021 • edited Loading

ebr commented Nov 11, 2021

rafilkmp3 commented Nov 12, 2021

rafilkmp3 commented Nov 12, 2021

jpikoulas commented Nov 23, 2021

stewartshea commented Nov 24, 2021

bboreham commented Nov 26, 2021

jmcarp commented Dec 23, 2021

alanprot commented Jan 6, 2022 • edited Loading

rafilkmp3 commented Feb 14, 2022

rafilkmp3 commented Feb 14, 2022

Rahuly360 commented Apr 7, 2022

sspreitzer commented Sep 13, 2022

sspreitzer commented Oct 4, 2022

kingtran2112 commented Nov 29, 2022 • edited Loading

alex-berger commented Aug 14, 2023

fhperuchi commented Dec 9, 2024

wuyafang commented Jul 19, 2019 •

edited

Loading

wuyafang commented Jul 19, 2019 •

edited

Loading

miklosbarabasForm3 commented Jul 30, 2019 •

edited

Loading

bboreham commented Jul 31, 2019 •

edited

Loading

ctorrisi commented Oct 20, 2021 •

edited

Loading

alanprot commented Jan 6, 2022 •

edited

Loading

kingtran2112 commented Nov 29, 2022 •

edited

Loading