Memory overconsuming #4445

ArseniiPetrovich · 2020-10-16T15:31:53Z

Describe the bug
We are a Protofire, and we are hosting managed Lotus nodes for several months already. One of our nodes are used internally for creating snapshots and never serves any RPC requests. Sometimes this node grab a massive amount of memory and do not return it back to the OS. A restart fixes this, but it is not a great workround. Is there a way to determine why a running node is holding memory? Thank you!

To Reproduce
Steps to reproduce the behavior:

Run lotus daemon
Run lotus chain export --tipset @$( lotus chain list --count 50 --format "<height>" | head -n1 ) --recent-stateroots 900 --skip-old-msgs /data/ipfs/lotus-hot.car on an hourly basis
See error

Expected behavior
The snapshot used to be created in 7 mins or less and now it's taking way more time to complete.

Version (run lotus version):
10.2

The text was updated successfully, but these errors were encountered:

astudnev · 2020-11-11T06:47:21Z

We have the same issue, Lotus consumes 96 G of memory and re-sstart does not fix it, it consumes them again

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
92 studnev 20 0 0.984t 0.097t 0.057t S 92.7 78.5 6085:48 lotus

raulk · 2020-12-03T13:31:42Z

Hi @ArseniiPetrovich @astudnev, what you are likely seeing is a combination of:

a badger compaction surge, which allocates a lot of memory very quickly.
Go GC being slow to trigger (this is remedied in implement a memory watchdog #5058).
the Go runtime using madv_free instead of madv_dontneed to return memory to the kernel.
your host not being under memory pressure from any other process.

Since go1.12, go started using the madv_free flag in madvise. This tells the kernel "hey, these memory pages are free; I might need to use them again soon, but you can take them if you need them". The kernel will keep them mapped to the process until another process exerts memory pressure. If Lotus is the only memory-consuming process that's running in that host, the memory will not be unmapped and it'll give the impression that Lotus is consuming an ever increasing amount of memory.

These memory pages are effectively free, but not accounted as so in most tooling, including most popular cgroups stats. We suspect that might also make the OOMKiller kick in when it shouldn't. Someone blogged about this: https://www.bwplotka.dev/2019/golang-memory-monitoring/

All of this this was problematic and caused quite a bit of misunderstanding in the community. Take a look at this list of related golang/go issues.

For that reason, as of go1.16, the go runtime will default to using madv_dontneed again. As a result, released memory will become visible again immediately.

Since you're likely building with go1.15.5, can you try restarting two Lotus instances at the same time (ideally with similar repo sizes), one of them with the following env variable, which unlocks this behaviour manually?

GODEBUG="madvdontneed=1"

If you can report back and ideally post some comparative charts, it would be very welcome.

rjan90 · 2021-07-31T06:23:27Z

I think this issue can be closed now, since the issue with data-transfers being RAM hungry is fixed! #rengjøring

jennijuju · 2021-08-16T06:53:31Z

Please open a new ticket if you are running into this issue on the latest version of lotus!

raulk mentioned this issue Dec 2, 2020

introduce memory watchdog; LOTUS_MAX_HEAP #5101

Merged

raulk mentioned this issue Dec 3, 2020

Data Transfers are RAM hungry #4877

Closed

jennijuju closed this as completed Aug 16, 2021

mur-me mentioned this issue Nov 23, 2021

[placeholder] - for check of the LOTUS_MAX_HEAP variable in Lotus protofire/Filecoin-node-hosting-management#154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory overconsuming #4445

Memory overconsuming #4445

ArseniiPetrovich commented Oct 16, 2020

astudnev commented Nov 11, 2020

raulk commented Dec 3, 2020

rjan90 commented Jul 31, 2021

jennijuju commented Aug 16, 2021

Memory overconsuming #4445

Memory overconsuming #4445

Comments

ArseniiPetrovich commented Oct 16, 2020

astudnev commented Nov 11, 2020

raulk commented Dec 3, 2020

rjan90 commented Jul 31, 2021

jennijuju commented Aug 16, 2021