Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingesters not passing readiness probe #1502

Closed
pablokbs opened this issue Jul 10, 2019 · 7 comments
Closed

Ingesters not passing readiness probe #1502

pablokbs opened this issue Jul 10, 2019 · 7 comments

Comments

@pablokbs
Copy link

I have 5 ingesters running and none of them are passing the readiness probe (/ready) ... I exec'ed into the pod and ran it manually and I'm getting a 503

The logs are only showing an memcached error (#1501) and some of the other pods are failing with:

nginx-765859c647-ghtcz nginx 172.27.86.136 - - [10/Jul/2019:13:41:47 +0000]  499 "POST /api/prom/push HTTP/1.1" 0 "-" "Prometheus/2.8.0" "-"
ingester-654d497d6c-kk48q ingester level=debug ts=2019-07-10T13:41:50.775600196Z caller=logging.go:44 traceID=21b65bc91639534 msg="GET /metrics (200) 2.290994ms"
querier-8695bc98f-mm85n querier level=error ts=2019-07-10T13:41:51.332976103Z caller=pool.go:170 msg="error removing stale clients" err="too many failed ingesters"
alertmanager-64b87454ff-lkh8w alertmanager level=debug ts=2019-07-10T13:41:52.05485785Z caller=multitenant.go:367 msg="adding configurations" num_configs=0
query-frontend-5cb5894767-5xzmf query-frontend level=debug ts=2019-07-10T13:41:52.748532948Z caller=logging.go:44 traceID=6b430eaabd84c74a msg="GET /metrics (200) 2.593358ms"
distributor-7c6b454b8f-f556c distributor level=error ts=2019-07-10T13:41:53.366709337Z caller=pool.go:170 msg="error removing stale clients" err="too many failed ingesters"
ruler-5ddb4b6fdf-pqmhj ruler level=error ts=2019-07-10T13:41:53.617524174Z caller=pool.go:170 msg="error removing stale clients" err="too many failed ingesters"
ruler-5ddb4b6fdf-pqmhj ruler level=debug ts=2019-07-10T13:41:53.624633426Z caller=scheduler.go:215 msg="adding configurations" num_configs=0
ingester-69579f7456-tll7r ingester level=debug ts=2019-07-10T13:41:57.262713983Z caller=logging.go:44 traceID=1d414319391a5ca2 msg="GET /ready (503) 1.648503ms"

What can I do the debug this issue in the ingesters?

@bboreham
Copy link
Contributor

Cortex itself doesn't care about readiness. If you visit the /ring page on a distributor in a browser, you should see the state as Cortex sees it.

@pablokbs
Copy link
Author

Nice, that helped, I was able to "Forget" the pods that were unhealthy and now I'm up and running, is there a way to automatically forget those unhealthy pods?

@bboreham
Copy link
Contributor

Right now we require human input because it's hard to decide what really happened - was it a bug in the ingester, or data corruption in the ring, etc.

@pablokbs
Copy link
Author

I see. Is there any plans on automating this process?

@bboreham
Copy link
Contributor

Can’t really plan around the absence of knowledge.

For instance, can you say why you had unhealthy pods?

@pablokbs
Copy link
Author

Is there somewhere I can look for that info? There is nothing in the logs in any service. But I know that I was replacing pods (changing vars or requests/limits on the deployment manifest) that replaced the pod and caused this, if I replace the pods I need to manually fix the cluster using the /ring interface?

@bboreham
Copy link
Contributor

When you shut down an ingester it needs to hand over its chunks to a new one or flush them all to the store, which can take many minutes. After that it will remove itself from the ring.

Under Kubernetes you will need a sufficiently large grace period on the pod definition.

Also #1307 means you need to raise -ingester.max-transfer-retries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants