-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributors OOM on a single slow ingester in the cluster #1895
Comments
I think this might be caused by #736 - we used to cancel the outstanding request, now we pile them up as you said. |
Yes, if you cancel the 3rd push every time then each ingester will have a random sprinkling of holes in the data, so the checksums won't match. |
BTW I just added a link to a blog post in #1578 that describes the efficiency gains. |
Thanks. I just wanted to make sure I understand it correctly, as I was just adding similar thing to Loki earlier today and hope to see similar benefits. (Loki already uses #736 change, so all is good there). |
Today we ran into the same issue which caused an outage of the write path in our prod environment.
I am unsure why the cortex ingester was slow at all, but I noticed it always had been the same ingester. I did not see any sign of the underlying Kubernetes node being faulty but I resolved the issue by draining that node, so that a new ingester starts. The problematic ingester failed to leave the ring and therefore I also had to manually forget the ingester. Since then the cluster seems to be stable again. |
We could count the number of in-flight requests to ingesters and fail (response 5xx) the incoming request when that number goes over a threshold. This would prevent OOM on the distributor. |
Slightly more sophisticated: Count the number of requests in-flight per ingester. If one of them is over a threshold, treat that ingester as unhealthy and spill the samples to the next one. Thus we don’t 500 back to the caller unless nearly all ingesters are impacted. |
+1 |
Is there any fix for this bug ? |
We added Note that |
It's very likely that the context for this issue is that a 20s timeout was used, instead of the default 2s I experienced the same issues, for years, using 20s as remote-timeout too, the problem was gone when timeouts were reduced back to 2s
Bryan was right |
Fixed on cortexproject/cortex-jsonnet#15 |
Yesterday we've got all distributors continuously
OOMKilled
in one of our Cortex clusters. The root cause analysis outlined this issue has been caused by a single ingester which was running on a failing Kubernetes node which was running but very slow.This issue is due to how the quorum works. When the distributors receive a
Push()
request, the time series are sharded and then sent to 3 ingesters (we have a replication factor of3
). The distributor'sPush()
request completes as soon as all series are pushed to at least 2 ingesters.In the case of a very slow ingester, the distributor piles up the number of in-flight requests towards the slow ingester, while the inbound
Push()
request is completed as soon as the other ingesters successfully complete the ingestion.This causes the memory used by the distributors to increase due to the in-flight requests towards the slow ingester.
In a high traffic Cortex cluster, distributors can hit the memory limit before the timeout of the in-flight requests towards the slow ingester is expired, causing all distributors to be
OOMKilled
(and subsequent distributors restarts will OOM again until the very slow ingester is not removed from the ring).The text was updated successfully, but these errors were encountered: