Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory overconsuming #4445

Closed
ArseniiPetrovich opened this issue Oct 16, 2020 · 4 comments
Closed

Memory overconsuming #4445

ArseniiPetrovich opened this issue Oct 16, 2020 · 4 comments

Comments

@ArseniiPetrovich
Copy link
Contributor

Describe the bug
We are a Protofire, and we are hosting managed Lotus nodes for several months already. One of our nodes are used internally for creating snapshots and never serves any RPC requests. Sometimes this node grab a massive amount of memory and do not return it back to the OS. A restart fixes this, but it is not a great workround. Is there a way to determine why a running node is holding memory? Thank you!

To Reproduce
Steps to reproduce the behavior:

  1. Run lotus daemon
  2. Run lotus chain export --tipset @$( lotus chain list --count 50 --format "<height>" | head -n1 ) --recent-stateroots 900 --skip-old-msgs /data/ipfs/lotus-hot.car on an hourly basis
  3. See error

Expected behavior
The snapshot used to be created in 7 mins or less and now it's taking way more time to complete.

Version (run lotus version):
10.2

@astudnev
Copy link

We have the same issue, Lotus consumes 96 G of memory and re-sstart does not fix it, it consumes them again

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
92 studnev 20 0 0.984t 0.097t 0.057t S 92.7 78.5 6085:48 lotus

@raulk
Copy link
Member

raulk commented Dec 3, 2020

Hi @ArseniiPetrovich @astudnev, what you are likely seeing is a combination of:

  1. a badger compaction surge, which allocates a lot of memory very quickly.
  2. Go GC being slow to trigger (this is remedied in implement a memory watchdog #5058).
  3. the Go runtime using madv_free instead of madv_dontneed to return memory to the kernel.
  4. your host not being under memory pressure from any other process.

Since go1.12, go started using the madv_free flag in madvise. This tells the kernel "hey, these memory pages are free; I might need to use them again soon, but you can take them if you need them". The kernel will keep them mapped to the process until another process exerts memory pressure. If Lotus is the only memory-consuming process that's running in that host, the memory will not be unmapped and it'll give the impression that Lotus is consuming an ever increasing amount of memory.

These memory pages are effectively free, but not accounted as so in most tooling, including most popular cgroups stats. We suspect that might also make the OOMKiller kick in when it shouldn't. Someone blogged about this: https://www.bwplotka.dev/2019/golang-memory-monitoring/

All of this this was problematic and caused quite a bit of misunderstanding in the community. Take a look at this list of related golang/go issues.

For that reason, as of go1.16, the go runtime will default to using madv_dontneed again. As a result, released memory will become visible again immediately.

Since you're likely building with go1.15.5, can you try restarting two Lotus instances at the same time (ideally with similar repo sizes), one of them with the following env variable, which unlocks this behaviour manually?

GODEBUG="madvdontneed=1"

If you can report back and ideally post some comparative charts, it would be very welcome.

@rjan90
Copy link
Contributor

rjan90 commented Jul 31, 2021

I think this issue can be closed now, since the issue with data-transfers being RAM hungry is fixed! #rengjøring

@jennijuju
Copy link
Member

Please open a new ticket if you are running into this issue on the latest version of lotus!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants