Distributors OOM on a single slow ingester in the cluster #1895

pracucci · 2019-12-10T07:51:52Z

Yesterday we've got all distributors continuously OOMKilled in one of our Cortex clusters. The root cause analysis outlined this issue has been caused by a single ingester which was running on a failing Kubernetes node which was running but very slow.

This issue is due to how the quorum works. When the distributors receive a Push() request, the time series are sharded and then sent to 3 ingesters (we have a replication factor of 3). The distributor's Push() request completes as soon as all series are pushed to at least 2 ingesters.

In the case of a very slow ingester, the distributor piles up the number of in-flight requests towards the slow ingester, while the inbound Push() request is completed as soon as the other ingesters successfully complete the ingestion.

This causes the memory used by the distributors to increase due to the in-flight requests towards the slow ingester.

In a high traffic Cortex cluster, distributors can hit the memory limit before the timeout of the in-flight requests towards the slow ingester is expired, causing all distributors to be OOMKilled (and subsequent distributors restarts will OOM again until the very slow ingester is not removed from the ring).

The text was updated successfully, but these errors were encountered:

tomwilkie · 2019-12-17T10:05:03Z

I think this might be caused by #736 - we used to cancel the outstanding request, now we pile them up as you said.

bboreham · 2019-12-17T10:51:51Z

#858 talks about a similar situation - we need to limit the number of "backgrounded" requests.

It sounds like a smaller timeout would help.

#736 was done for good reasons, and is essential to the efficiency gain from #1578.

pstibrany · 2019-12-18T13:02:12Z

#736 was done for good reasons, and is essential to the efficiency gain from #1578.

Can you please elaborate on how #1578 is related to #736? Is it to make sure that each ingesters gets exactly the same data, and not only part of it due to parent context being cancelled/timed out?

bboreham · 2019-12-18T13:09:20Z

Yes, if you cancel the 3rd push every time then each ingester will have a random sprinkling of holes in the data, so the checksums won't match.

bboreham · 2019-12-18T13:12:03Z

BTW I just added a link to a blog post in #1578 that describes the efficiency gains.

pstibrany · 2019-12-18T13:15:35Z

BTW I just added a link to a blog post in #1578 that describes the efficiency gains.

Thanks. I just wanted to make sure I understand it correctly, as I was just adding similar thing to Loki earlier today and hope to see similar benefits. (Loki already uses #736 change, so all is good there).

weeco · 2020-02-25T13:43:48Z

Today we ran into the same issue which caused an outage of the write path in our prod environment.

At first the CPU usage of a single ingester jumped from an expected 25-40% CPU usage to 80-100%.
At the same time the RAM usage ramped up to the point where the ingester eventually got OOM killed (under normal conditions the ingester takes 25% of the available RAM). It took just 5 minutes to exceed the RAM limits
Even Before the ingester was eventually OOM killed the first distributor pods have been OOM killed. The distributors finally were all constantly restarting because of OOM kills

I am unsure why the cortex ingester was slow at all, but I noticed it always had been the same ingester. I did not see any sign of the underlying Kubernetes node being faulty but I resolved the issue by draining that node, so that a new ingester starts. The problematic ingester failed to leave the ring and therefore I also had to manually forget the ingester. Since then the cluster seems to be stable again.

bboreham · 2020-02-25T17:13:30Z

We could count the number of in-flight requests to ingesters and fail (response 5xx) the incoming request when that number goes over a threshold. This would prevent OOM on the distributor.

bboreham · 2020-02-26T13:14:55Z

Slightly more sophisticated:

Count the number of requests in-flight per ingester. If one of them is over a threshold, treat that ingester as unhealthy and spill the samples to the next one. Thus we don’t 500 back to the caller unless nearly all ingesters are impacted.
Also we can expose the per-ingester counts as metrics.

jakirpatel · 2021-08-31T07:14:33Z

+1

jakirpatel · 2021-08-31T07:17:09Z

Is there any fix for this bug ?

bboreham · 2021-08-31T10:04:26Z

We added -distributor.instance-limits.max-inflight-push-requests and -ingester.instance-limits.max-inflight-push-requests in 1.9.0.

Note that -distributor.instance-limits.max-inflight-push-requests does not address this problem on its own, because it decrements the counter after 2 responses have been received; the 3rd is still active but not counted.
But I think setting -ingester.max-concurrent-streams will prevent new calls from starting, so all three together should work as a fix.

friedrichg · 2023-03-10T02:42:13Z

It's very likely that the context for this issue is that a 20s timeout was used, instead of the default 2s
https://github.com/cortexproject/cortex-jsonnet/blob/3ff1d4cfcbfa28de1b83c33d42d74749e4c9c97b/cortex/distributor.libsonnet#L16

I experienced the same issues, for years, using 20s as remote-timeout too, the problem was gone when timeouts were reduced back to 2s

It sounds like a smaller timeout would help.

Bryan was right

friedrichg · 2023-03-10T02:45:05Z

Fixed on cortexproject/cortex-jsonnet#15

pracucci mentioned this issue Dec 10, 2019

Fix reuse slice in distributor.Push() #1898

Merged

3 tasks

gouthamve added component/distributor type/bug labels Dec 18, 2019

weeco mentioned this issue Sep 26, 2020

Distributors using spectacular amount of memory #3188

Closed

friedrichg closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributors OOM on a single slow ingester in the cluster #1895

Distributors OOM on a single slow ingester in the cluster #1895

pracucci commented Dec 10, 2019

tomwilkie commented Dec 17, 2019

bboreham commented Dec 17, 2019

pstibrany commented Dec 18, 2019

bboreham commented Dec 18, 2019

bboreham commented Dec 18, 2019

pstibrany commented Dec 18, 2019 •

edited

Loading

weeco commented Feb 25, 2020

bboreham commented Feb 25, 2020

bboreham commented Feb 26, 2020

jakirpatel commented Aug 31, 2021

jakirpatel commented Aug 31, 2021

bboreham commented Aug 31, 2021

friedrichg commented Mar 10, 2023

friedrichg commented Mar 10, 2023

Distributors OOM on a single slow ingester in the cluster #1895

Distributors OOM on a single slow ingester in the cluster #1895

Comments

pracucci commented Dec 10, 2019

tomwilkie commented Dec 17, 2019

bboreham commented Dec 17, 2019

pstibrany commented Dec 18, 2019

bboreham commented Dec 18, 2019

bboreham commented Dec 18, 2019

pstibrany commented Dec 18, 2019 • edited Loading

weeco commented Feb 25, 2020

bboreham commented Feb 25, 2020

bboreham commented Feb 26, 2020

jakirpatel commented Aug 31, 2021

jakirpatel commented Aug 31, 2021

bboreham commented Aug 31, 2021

friedrichg commented Mar 10, 2023

friedrichg commented Mar 10, 2023

pstibrany commented Dec 18, 2019 •

edited

Loading