Huge memory usage by retention policy #10453

aslobodskoy-fiksu · 2018-11-05T09:24:11Z

System info: Official docker image 1.6.4-alpine

Steps to reproduce:

Run retention on huge DB
Check Memory consumption

Actual behavior: Huge memory consumption
Additional info:

PS
With 1.6.3 we got little bit another behaviour. Once per week (shard group duration 7 days) it was running extremely long(2-3 hours), continuously consume memory and crashes because of OOM.

e-dard · 2018-11-15T10:31:01Z

@aslobodskoy-fiksu which index are you using? inmem or tsi1? If you're using inmem I recommend you upgrade to tsi1 and then see how your heap looks. Removing shards with tsi1 is much cheaper.

aslobodskoy-fiksu · 2018-11-15T12:46:28Z

@e-dard I don't see such huge spikes anymore but overall memory consumption doesn't look great.

I'm going to try tsi1

aslobodskoy-fiksu · 2018-11-26T07:27:28Z

I've switched influxdb to tsi1. The spikes are not so high but general trend of memory consumption doesn't look fine

persberry · 2019-01-23T13:21:17Z

Hi,

I tried to troubleshoot the same issue with retention policy. I switched to tsi1 (removed everything from the db before that) and now have very similar simptoms, memory usage is increasing.

I'll be very grateful for any relevant suggestion.

Influxdb 1.7.3 is running in docker swarm, as a service. Data is stored on aws ebs gp2 volume.
Unix RSS typically looks like that (taken from /sys/fs/cgroup/memory/memory.stat):

At the time when RSS=3.2GB,

pprof:
Count Profile
3196 allocs
43 block
0 cmdline
64 goroutine
3196 heap
35 mutex
0 profile
15 threadcreate
0 trace
Heap and allocs are always increasing.

pprof heap top:
Showing nodes accounting for 1202.09MB, 97.66% of 1230.92MB total
Dropped 76 nodes (cum <= 6.15MB)
Showing top 10 nodes out of 66
flat flat% sum% cum cum%
937.03MB 76.12% 76.12% 937.03MB 76.12% github.com/influxdata/influxdb/pkg/pool.(*LimitedBytes).Get
54.05MB 4.39% 80.52% 54.05MB 4.39% github.com/influxdata/influxdb/pkg/rhh.assign
48.19MB 3.91% 84.43% 48.19MB 3.91% bytes.makeSlice
40.45MB 3.29% 87.72% 58.95MB 4.79% github.com/influxdata/influxdb/tsdb/index/tsi1.(*LogFile).execSeriesEntry
40.33MB 3.28% 90.99% 66.11MB 5.37% github.com/influxdata/influxdb/tsdb/engine/tsm1.(*partition).write
Note: the (*LimitedBytes).Get always increases.

pprof allocs top:
Showing nodes accounting for 157.39GB, 83.99% of 187.39GB total
Dropped 700 nodes (cum <= 0.94GB)
Showing top 10 nodes out of 84
flat flat% sum% cum cum%
56.02GB 29.90% 29.90% 56.02GB 29.90% bytes.makeSlice
24.58GB 13.12% 43.01% 64.33GB 34.33% github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).WritePoints
17.46GB 9.32% 52.33% 19.04GB 10.16% github.com/influxdata/influxdb/tsdb/engine/tsm1.(*partition).write
16.97GB 9.06% 61.39% 36.01GB 19.22% github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).WriteMulti
15.87GB 8.47% 69.86% 15.87GB 8.47% github.com/influxdata/influxdb/tsdb/engine/tsm1.(*partition).keys
15.45GB 8.25% 78.11% 15.45GB 8.25% github.com/influxdata/influxdb/tsdb/engine/tsm1.(*ring).apply.func1

Logs are all the same:
lvl=info msg="Cache snapshot (end)" log_id=0DA_XgDG000 engine=tsm1 trace_id=0DAhDdJW000 op_name=tsm1_cache_snapshot op_event=end op_elapsed=292.107ms
lvl=info msg="Snapshot for path written" log_id=0DA_XgDG000 engine=tsm1 trace_id=0DAhDdJW000 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/data/db/db_ret_policy/3 duration=292.083ms
vl=info msg="Removing WAL file" log_id=0DA_XgDG000 engine=tsm1 service=wal path=/var/lib/influxdb/wal/db/db_ret_policy/3/_18950.wal
lvl=info msg="Snapshot for path deduplicated" log_id=0DA_XgDG000 engine=tsm1 path=/var/lib/influxdb/data/db/db_ret_policy/3 duration=28.453ms

Looks like the issue is in func (p *LimitedBytes) Get(sz int) which makes no sence as it's a part of utility package.

stale · 2019-07-23T10:32:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2019-07-30T10:42:23Z

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

Dileep-Dora · 2019-09-25T11:51:20Z

Is this issue got fixed?
We're facing issues with influx 1.7.2 version where retention check interval is 30 minutes. After staging 30 minutes memory is going high and influx dying because of OOM.

Total memory is 30GB.

e-dard · 2019-09-25T12:11:07Z

@Dileep-Dora my suspicion there is that you're running the inmem (default) index. In that case when a shard is dropped off the index needs to be interrogated for things that might need to be removed. It could be causing your heap issue. If your cardinality is high (in the millions) consider converting to the TSI index, which will make dropping off shards much cheaper.

lobocobra · 2019-11-11T23:42:07Z

Influxdb is crashing my VMM server every 6h - 24h. I suspected that memory runs out and this seems to be the case. This problem happens since I had the bad idea to upgrade to v1.7.x, the more I upgrade the less longer it works.

So I changed the config file from TSM to TSI started following command...

shutdown influxdb
sudo -H -u influxdb bash -c influx_inspect buildtsi -datadir /var/lib/influxdb/data -waldir /var/lib/influxdb/wal
.... btw above command has the right quotes like $()

After 5h not one single index folder was created... I checked it by...
find /var/lib/influxdb -type d -name index
=> NOTHING found
I just wanted to have an alternative to RRD that works, but I have now an unstable homeautomation due to influxdb.
=> Anyone has a hint ? Should I downgrade to 1.64? if yes HOW?
=> Should I increase SHRED? if yes HOW?
=> Should I move from TSM to TSI? .... well guess... HOW?

I know you guys put a lot of effort in this DB, but I want to avoid to become an influxDB expert, before it works without OOM

e-dard · 2019-11-12T15:53:46Z

@lobocobra sorry to hear you're having problems. Those sorts of questions seem like they would be better answered on the community forum, rather than the issue tracker. https://community.influxdata.com/

lobocobra · 2019-11-12T23:59:17Z

thanks for the response. I understand....
.... but looking at the numbers of persons that report a OOM situation I wonder if this is not a bug?

I isolated the issue by moving influxdb to a new virtual server... guess what server crashed after a while.

In 1.6.4 I had never such problems. I guess I have simply to find out how to downgrade influxdb and then I will wait until the bug was found and solved, before I upgrade again.
A software that increases memory usage to the point that the underlying system is crashing, has in my humble opinion a bug. And I see little chance to have a bug solved in a community.

Some examples...
#10468
=> He deleted the data... well not really a solution and a 2nd guy just doubled he has the same issue
https://community.influxdata.com/t/memory-leak-in-influxdb-1-7-4/8889
=> also an VM ....
https://community.influxdata.com/t/memory-increase-slowly-over-17-hours-until-oom-killed-it/3893
...
Some could think... yea not the latest version... but I have the latest and still have the issue. :(

positron96 · 2019-11-26T12:16:57Z

@lobocobra
There are some instructions for downgrading from 1.7, it doesn't look very difficult: https://docs.influxdata.com/influxdb/v1.7/administration/upgrading/, it just tells you to delete tsi indexes and turn on inmem index in config.
I'm also somewhat struggling with my pet influxdb's increased memory consumption and disk io spikes (which I upgraded from 1.6 to 1.7.9 on a 512Mb cloud server), will try downgrading soon if it continues to be the case.

Doc999tor · 2019-12-28T23:58:26Z

@lobocobra @positron96
Any updates regarding the downgrade?
I'm on 1.7.4 and hit the 800M for influxdb, considering to try downgrading as well

positron96 · 2019-12-30T05:22:22Z

@lobocobra @positron96
Any updates regarding the downgrade?
I'm on 1.7.4 and hit the 800M for influxdb, considering to try downgrading as well

Well, I downgraded to 1.7.1 and disabled TSI, it already helped remove disk usage spikes and RAM consumption is some 50-80 mb less (quite a lot on 512 mb server).
I might downgrade further to 1.6, but not in the near future, since my immediate problems are currently solved.

Here is transition from 1.7.9 to 1.7.1 and disabling TSI:

lobocobra · 2020-01-11T10:19:05Z

@positron96
Many thanks for your feed-back. After my server froze again, I do now an attempt to go back to version 1.6.4.
I am surprized, that this bug is not fixed. How a package can regularily crash the underlying server, without this is fixed? I would expect that the sw recognizes such situations before it happens and either shut-down or restarts.

Here are the steps I did to downgrade, it might help others:
wget https://dl.influxdata.com/influxdb/releases/influxdb_1.6.4_amd64.deb
sudo dpkg -i influxdb_1.6.4_amd64.deb

Here how to re-index.... (only neded for ts1)
https://community.hiveeyes.org/t/repair-influxdb-tsi-index-files/1107

Memory usage went down from 1.1GB to 340 MB.

horsto · 2023-02-27T15:02:13Z

Is this still an issue for people? My influx db memory usage increases steadily over the course of 72h (approx.) until it basically incapacitates the server it is running on. I wonder if there is a more recent fix for this?

dgnorton added area/performance oom labels Nov 12, 2018

aslobodskoy-fiksu mentioned this issue Nov 23, 2018

Retention policy deletion check make influxdb out of memory #10521

Closed

dgnorton added the 1.x label Jan 7, 2019

stale bot added the wontfix label Jul 23, 2019

stale bot closed this as completed Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge memory usage by retention policy #10453

Huge memory usage by retention policy #10453

aslobodskoy-fiksu commented Nov 5, 2018

e-dard commented Nov 15, 2018

aslobodskoy-fiksu commented Nov 15, 2018

aslobodskoy-fiksu commented Nov 26, 2018

persberry commented Jan 23, 2019

stale bot commented Jul 23, 2019

stale bot commented Jul 30, 2019

Dileep-Dora commented Sep 25, 2019

e-dard commented Sep 25, 2019

lobocobra commented Nov 11, 2019 •

edited

Loading

e-dard commented Nov 12, 2019

lobocobra commented Nov 12, 2019 •

edited

Loading

positron96 commented Nov 26, 2019

Doc999tor commented Dec 28, 2019

positron96 commented Dec 30, 2019 •

edited

Loading

lobocobra commented Jan 11, 2020 •

edited

Loading

horsto commented Feb 27, 2023

Huge memory usage by retention policy #10453

Huge memory usage by retention policy #10453

Comments

aslobodskoy-fiksu commented Nov 5, 2018

e-dard commented Nov 15, 2018

aslobodskoy-fiksu commented Nov 15, 2018

aslobodskoy-fiksu commented Nov 26, 2018

persberry commented Jan 23, 2019

stale bot commented Jul 23, 2019

stale bot commented Jul 30, 2019

Dileep-Dora commented Sep 25, 2019

e-dard commented Sep 25, 2019

lobocobra commented Nov 11, 2019 • edited Loading

e-dard commented Nov 12, 2019

lobocobra commented Nov 12, 2019 • edited Loading

positron96 commented Nov 26, 2019

Doc999tor commented Dec 28, 2019

positron96 commented Dec 30, 2019 • edited Loading

lobocobra commented Jan 11, 2020 • edited Loading

horsto commented Feb 27, 2023

lobocobra commented Nov 11, 2019 •

edited

Loading

lobocobra commented Nov 12, 2019 •

edited

Loading

positron96 commented Dec 30, 2019 •

edited

Loading

lobocobra commented Jan 11, 2020 •

edited

Loading