Receive's memory usage continues to grow in v0.31.0-rc.0 #6176

yutian1224 · 2023-03-02T03:46:50Z

Thanos, Prometheus and Golang version used:
Thanos: v0.31.0-rc.0

Object Storage Provider:
S3

What happened:
I went from 0.30.2 to the new version around 8 o'clock and noticed that the memory kept growing until I rolled back.

Args:

receive
--log.level=info
--log.format=logfmt
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--remote-write.address=0.0.0.0:19291
--objstore.config=$(OBJSTORE_CONFIG)
--tsdb.path=/var/thanos/receive
--label=thanosreplica="$(NAME)"
--label=receive="true"
--tsdb.retention=1d
--receive.local-endpoint=$(NAME).$(NAMESPACE).svc.cluster.local:10901
--receive.grpc-compression=snappy
--tsdb.out-of-order.time-window=1h
--store.limits.request-samples=1000
--store.limits.request-series=10000

The text was updated successfully, but these errors were encountered:

fpetkovski · 2023-03-02T06:34:01Z

Would you mind posting a graph of head series and samples ingested for the same time period?

The metrics are prometheus_tsdb_head_series and prometheus_tsdb_head_samples_appended_total.

yutian1224 · 2023-03-02T07:38:43Z

I'm afraid we didn't collect these metrics, the figure below is the series_count_by_metricname data in the /api/v1/status/tsdb, which is actually not a big increase compared to 0.30.2.

fpetkovski · 2023-03-02T10:52:59Z

@philipgough @saswatamcode would it be possible to test the RC with your load testing framework to see if there's a memory regression with Thanos itself?

douglascamata · 2023-03-02T11:32:20Z

@yutian1224 do you know if there might have been queries being executed in your cluster that could be touching the "hot data" in Receives?

You said you rolled back, but the right edge of the chart still shows a trend upwards. How's the memory usage since you rolled back?

yutian1224 · 2023-03-02T13:27:35Z

@douglascamata
Our receive is mainly used for grafana chart query and alarm settings. I don't know if the "hot data" here refers to the alarm part.
The alarm query usually has a fixed interval and is continuous.

As shown in the figure below, the memory usage before updating 0.31.0 and after rolling back is relatively stable.

philipgough · 2023-03-06T14:44:52Z

@fpetkovski I wont get a chance to do so this week due to other commitments but I can check next week.
What I can say now is that we are already running the RC in one of our production environments and we are not seeing the issues reported here

yutian1224 · 2023-03-10T02:10:33Z

@philipgough I tested the instance without the --store.limit parameter and the results were stable.
So I suspect it is the problem caused by --store.limit

philipgough · 2023-03-13T11:53:25Z

@yutian1224 interesting, thanks for confirming. We were indeed running the RC without those limit flags.

cc @fpetkovski

fpetkovski · 2023-03-13T13:33:54Z

I can test later this week, thanks for looking into it.

matej-g · 2023-03-17T10:29:42Z

Interesting, @fpetkovski I assume it then could be #6074, I just realized we only added those flags this release.

Would be interesting to see a profile, where all the memory is being hogged.

fpetkovski · 2023-03-20T12:28:52Z

I enabled these flags in our staging environment but could not reproduce the described memory issue. @yutian1224 are you able to reproduce this problem consistently?

yutian1224 · 2023-03-21T06:19:11Z

@fpetkovski Yes, except for the first time, I also made a comparison before and after removing the limit flag, and this problem was also reproduced.
So will it might be a problem of large amounts of data? At present, each receive instance in our environment receives about 3 million series.

fpetkovski · 2023-03-21T06:55:20Z

3M series should not be that much data. Would you mind providing a heap profile when you reproduce the issue? You can get it by hitting the /debug/pprof/heap endpoint on the receiver port 10902.

yutian1224 · 2023-03-22T02:08:10Z

@fpetkovski Sure, I'll test it on the weekend

fpetkovski · 2023-03-22T09:22:55Z

I enabled these two flags in staging and ran Receivers for about a day. I cannot reproduce the memory leak so I think we can release 0.31.0 as it is now. Once we have the heap profile we can check whether the limits are the culprit of this and cut 0.31.1 if necessary.

matej-g · 2023-03-22T10:01:10Z

Sounds good to me, thanks for checking @yutian1224 and @fpetkovski 👍

yutian1224 · 2023-03-25T13:07:53Z

@fpetkovski I deployed 0.31.0 and enabled limit flags yesterday, the memory problem reappeared.
The zip file is the pprof heap file captured when the memory usage was about 48%.

receive.prof.zip

douglascamata · 2023-03-31T10:04:15Z

FYI I took the profile and uploaded to this web visualization tool: https://flamegraph.com/share/7c78f5a0-cfa9-11ed-9b0d-d641223b6af4.

I'm not sure what's the problem, but github.com/thanos-io/thanos/pkg/receive.newReplicationErrors caught my attention: 7.6 GB of heap there. 🤔

philipgough · 2023-03-31T10:33:31Z

I wonder is there some contention caused by these low read limits on receivers that is effecting the ingestion path.

@yutian1224 Can you confirm if the limits were being hit or not? Can you increase your previous limits * 100 and see if the problem remains?

yutian1224 · 2023-03-31T13:59:32Z

@philipgough I am pretty sure the limits were being hit.
The right side is the network traffic at that time. After adding the limit, you can see that the out traffic has decreased significantly, but the in traffic has not changed significantly.

fpetkovski · 2023-03-31T15:45:38Z

This is really interesting. Is the yellow line outgoing traffic and why is it negative?

yutian1224 · 2023-03-31T23:15:25Z

@fpetkovski
For the convenience of display, we display the incoming and outgoing traffic in one panel, which is represented by positive and negative.😄

fpetkovski mentioned this issue Mar 31, 2023

Store: High memory usage on startup after upgrarding to 0.31.0 #6251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receive's memory usage continues to grow in v0.31.0-rc.0 #6176

Receive's memory usage continues to grow in v0.31.0-rc.0 #6176

yutian1224 commented Mar 2, 2023

fpetkovski commented Mar 2, 2023

yutian1224 commented Mar 2, 2023

fpetkovski commented Mar 2, 2023

douglascamata commented Mar 2, 2023

yutian1224 commented Mar 2, 2023

philipgough commented Mar 6, 2023

yutian1224 commented Mar 10, 2023

philipgough commented Mar 13, 2023

fpetkovski commented Mar 13, 2023

matej-g commented Mar 17, 2023

fpetkovski commented Mar 20, 2023

yutian1224 commented Mar 21, 2023

fpetkovski commented Mar 21, 2023

yutian1224 commented Mar 22, 2023

fpetkovski commented Mar 22, 2023

matej-g commented Mar 22, 2023

yutian1224 commented Mar 25, 2023

douglascamata commented Mar 31, 2023

philipgough commented Mar 31, 2023

yutian1224 commented Mar 31, 2023

fpetkovski commented Mar 31, 2023

yutian1224 commented Mar 31, 2023

Receive's memory usage continues to grow in v0.31.0-rc.0 #6176

Receive's memory usage continues to grow in v0.31.0-rc.0 #6176

Comments

yutian1224 commented Mar 2, 2023

fpetkovski commented Mar 2, 2023

yutian1224 commented Mar 2, 2023

fpetkovski commented Mar 2, 2023

douglascamata commented Mar 2, 2023

yutian1224 commented Mar 2, 2023

philipgough commented Mar 6, 2023

yutian1224 commented Mar 10, 2023

philipgough commented Mar 13, 2023

fpetkovski commented Mar 13, 2023

matej-g commented Mar 17, 2023

fpetkovski commented Mar 20, 2023

yutian1224 commented Mar 21, 2023

fpetkovski commented Mar 21, 2023

yutian1224 commented Mar 22, 2023

fpetkovski commented Mar 22, 2023

matej-g commented Mar 22, 2023

yutian1224 commented Mar 25, 2023

douglascamata commented Mar 31, 2023

philipgough commented Mar 31, 2023

yutian1224 commented Mar 31, 2023

fpetkovski commented Mar 31, 2023

yutian1224 commented Mar 31, 2023