Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive's memory usage continues to grow in v0.31.0-rc.0 #6176

Open
yutian1224 opened this issue Mar 2, 2023 · 22 comments
Open

Receive's memory usage continues to grow in v0.31.0-rc.0 #6176

yutian1224 opened this issue Mar 2, 2023 · 22 comments

Comments

@yutian1224
Copy link

Thanos, Prometheus and Golang version used:
Thanos: v0.31.0-rc.0

Object Storage Provider:
S3

What happened:
I went from 0.30.2 to the new version around 8 o'clock and noticed that the memory kept growing until I rolled back.
image

Args:

receive
--log.level=info
--log.format=logfmt
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--remote-write.address=0.0.0.0:19291
--objstore.config=$(OBJSTORE_CONFIG)
--tsdb.path=/var/thanos/receive
--label=thanosreplica="$(NAME)"
--label=receive="true"
--tsdb.retention=1d
--receive.local-endpoint=$(NAME).$(NAMESPACE).svc.cluster.local:10901
--receive.grpc-compression=snappy
--tsdb.out-of-order.time-window=1h
--store.limits.request-samples=1000
--store.limits.request-series=10000
@fpetkovski
Copy link
Contributor

Would you mind posting a graph of head series and samples ingested for the same time period?

The metrics are prometheus_tsdb_head_series and prometheus_tsdb_head_samples_appended_total.

@yutian1224
Copy link
Author

I'm afraid we didn't collect these metrics, the figure below is the series_count_by_metricname data in the /api/v1/status/tsdb, which is actually not a big increase compared to 0.30.2.
image

@fpetkovski
Copy link
Contributor

@philipgough @saswatamcode would it be possible to test the RC with your load testing framework to see if there's a memory regression with Thanos itself?

@douglascamata
Copy link
Contributor

@yutian1224 do you know if there might have been queries being executed in your cluster that could be touching the "hot data" in Receives?

You said you rolled back, but the right edge of the chart still shows a trend upwards. How's the memory usage since you rolled back?

@yutian1224
Copy link
Author

@douglascamata
Our receive is mainly used for grafana chart query and alarm settings. I don't know if the "hot data" here refers to the alarm part.
The alarm query usually has a fixed interval and is continuous.

As shown in the figure below, the memory usage before updating 0.31.0 and after rolling back is relatively stable.
image

@philipgough
Copy link
Contributor

@fpetkovski I wont get a chance to do so this week due to other commitments but I can check next week.
What I can say now is that we are already running the RC in one of our production environments and we are not seeing the issues reported here

@yutian1224
Copy link
Author

@philipgough I tested the instance without the --store.limit parameter and the results were stable.
So I suspect it is the problem caused by --store.limit

@philipgough
Copy link
Contributor

@yutian1224 interesting, thanks for confirming. We were indeed running the RC without those limit flags.

cc @fpetkovski

@fpetkovski
Copy link
Contributor

I can test later this week, thanks for looking into it.

@matej-g
Copy link
Collaborator

matej-g commented Mar 17, 2023

Interesting, @fpetkovski I assume it then could be #6074, I just realized we only added those flags this release.

Would be interesting to see a profile, where all the memory is being hogged.

@fpetkovski
Copy link
Contributor

I enabled these flags in our staging environment but could not reproduce the described memory issue. @yutian1224 are you able to reproduce this problem consistently?

@yutian1224
Copy link
Author

@fpetkovski Yes, except for the first time, I also made a comparison before and after removing the limit flag, and this problem was also reproduced.
So will it might be a problem of large amounts of data? At present, each receive instance in our environment receives about 3 million series.

@fpetkovski
Copy link
Contributor

3M series should not be that much data. Would you mind providing a heap profile when you reproduce the issue? You can get it by hitting the /debug/pprof/heap endpoint on the receiver port 10902.

@yutian1224
Copy link
Author

@fpetkovski Sure, I'll test it on the weekend

@fpetkovski
Copy link
Contributor

I enabled these two flags in staging and ran Receivers for about a day. I cannot reproduce the memory leak so I think we can release 0.31.0 as it is now. Once we have the heap profile we can check whether the limits are the culprit of this and cut 0.31.1 if necessary.

image

@matej-g
Copy link
Collaborator

matej-g commented Mar 22, 2023

Sounds good to me, thanks for checking @yutian1224 and @fpetkovski 👍

@yutian1224
Copy link
Author

@fpetkovski I deployed 0.31.0 and enabled limit flags yesterday, the memory problem reappeared.
The zip file is the pprof heap file captured when the memory usage was about 48%.

image
receive.prof.zip

@douglascamata
Copy link
Contributor

FYI I took the profile and uploaded to this web visualization tool: https://flamegraph.com/share/7c78f5a0-cfa9-11ed-9b0d-d641223b6af4.

I'm not sure what's the problem, but github.com/thanos-io/thanos/pkg/receive.newReplicationErrors caught my attention: 7.6 GB of heap there. 🤔

@philipgough
Copy link
Contributor

I wonder is there some contention caused by these low read limits on receivers that is effecting the ingestion path.

@yutian1224 Can you confirm if the limits were being hit or not? Can you increase your previous limits * 100 and see if the problem remains?

@yutian1224
Copy link
Author

@philipgough I am pretty sure the limits were being hit.
The right side is the network traffic at that time. After adding the limit, you can see that the out traffic has decreased significantly, but the in traffic has not changed significantly.
image

@fpetkovski
Copy link
Contributor

This is really interesting. Is the yellow line outgoing traffic and why is it negative?

@yutian1224
Copy link
Author

@fpetkovski
For the convenience of display, we display the incoming and outgoing traffic in one panel, which is represented by positive and negative.😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants