Ingesters not passing readiness probe #1502

pablokbs · 2019-07-10T13:49:55Z

I have 5 ingesters running and none of them are passing the readiness probe (/ready) ... I exec'ed into the pod and ran it manually and I'm getting a 503

The logs are only showing an memcached error (#1501) and some of the other pods are failing with:

nginx-765859c647-ghtcz nginx 172.27.86.136 - - [10/Jul/2019:13:41:47 +0000]  499 "POST /api/prom/push HTTP/1.1" 0 "-" "Prometheus/2.8.0" "-"
ingester-654d497d6c-kk48q ingester level=debug ts=2019-07-10T13:41:50.775600196Z caller=logging.go:44 traceID=21b65bc91639534 msg="GET /metrics (200) 2.290994ms"
querier-8695bc98f-mm85n querier level=error ts=2019-07-10T13:41:51.332976103Z caller=pool.go:170 msg="error removing stale clients" err="too many failed ingesters"
alertmanager-64b87454ff-lkh8w alertmanager level=debug ts=2019-07-10T13:41:52.05485785Z caller=multitenant.go:367 msg="adding configurations" num_configs=0
query-frontend-5cb5894767-5xzmf query-frontend level=debug ts=2019-07-10T13:41:52.748532948Z caller=logging.go:44 traceID=6b430eaabd84c74a msg="GET /metrics (200) 2.593358ms"
distributor-7c6b454b8f-f556c distributor level=error ts=2019-07-10T13:41:53.366709337Z caller=pool.go:170 msg="error removing stale clients" err="too many failed ingesters"
ruler-5ddb4b6fdf-pqmhj ruler level=error ts=2019-07-10T13:41:53.617524174Z caller=pool.go:170 msg="error removing stale clients" err="too many failed ingesters"
ruler-5ddb4b6fdf-pqmhj ruler level=debug ts=2019-07-10T13:41:53.624633426Z caller=scheduler.go:215 msg="adding configurations" num_configs=0
ingester-69579f7456-tll7r ingester level=debug ts=2019-07-10T13:41:57.262713983Z caller=logging.go:44 traceID=1d414319391a5ca2 msg="GET /ready (503) 1.648503ms"

What can I do the debug this issue in the ingesters?

The text was updated successfully, but these errors were encountered:

bboreham · 2019-07-10T14:06:20Z

Cortex itself doesn't care about readiness. If you visit the /ring page on a distributor in a browser, you should see the state as Cortex sees it.

pablokbs · 2019-07-10T14:25:24Z

Nice, that helped, I was able to "Forget" the pods that were unhealthy and now I'm up and running, is there a way to automatically forget those unhealthy pods?

bboreham · 2019-07-10T15:13:03Z

Right now we require human input because it's hard to decide what really happened - was it a bug in the ingester, or data corruption in the ring, etc.

pablokbs · 2019-07-10T17:04:59Z

I see. Is there any plans on automating this process?

bboreham · 2019-07-10T21:29:23Z

Can’t really plan around the absence of knowledge.

For instance, can you say why you had unhealthy pods?

pablokbs · 2019-07-10T21:54:09Z

Is there somewhere I can look for that info? There is nothing in the logs in any service. But I know that I was replacing pods (changing vars or requests/limits on the deployment manifest) that replaced the pod and caused this, if I replace the pods I need to manually fix the cluster using the /ring interface?

bboreham · 2019-07-11T06:19:34Z

When you shut down an ingester it needs to hand over its chunks to a new one or flush them all to the store, which can take many minutes. After that it will remove itself from the ring.

Under Kubernetes you will need a sufficiently large grace period on the pod definition.

Also #1307 means you need to raise -ingester.max-transfer-retries

bboreham closed this as completed Jul 18, 2019

wuyafang mentioned this issue Jul 19, 2019

the ring never removes old ingester even if the ingester pod is evicted #1521

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingesters not passing readiness probe #1502

Ingesters not passing readiness probe #1502

pablokbs commented Jul 10, 2019

bboreham commented Jul 10, 2019

pablokbs commented Jul 10, 2019

bboreham commented Jul 10, 2019

pablokbs commented Jul 10, 2019

bboreham commented Jul 10, 2019

pablokbs commented Jul 10, 2019

bboreham commented Jul 11, 2019

Ingesters not passing readiness probe #1502

Ingesters not passing readiness probe #1502

Comments

pablokbs commented Jul 10, 2019

bboreham commented Jul 10, 2019

pablokbs commented Jul 10, 2019

bboreham commented Jul 10, 2019

pablokbs commented Jul 10, 2019

bboreham commented Jul 10, 2019

pablokbs commented Jul 10, 2019

bboreham commented Jul 11, 2019