Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Transfers are RAM hungry #4877

Closed
dineshshenoy opened this issue Nov 16, 2020 · 16 comments
Closed

Data Transfers are RAM hungry #4877

dineshshenoy opened this issue Nov 16, 2020 · 16 comments
Assignees
Labels
area/markets/client area/markets/storage P2 P2: Should be resolved team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs

Comments

@dineshshenoy
Copy link

dineshshenoy commented Nov 16, 2020

The memory usage when creating client deals is very high. @jsign created 5 deals (40 GB total) which increased RAM usage by 15GB until transfers finished. Lotus total peak RAM usage = 40GB for these 5 concurrent-ish deals

cc: @hannahhoward

@Stebalien
Copy link
Member

Which version?

@dineshshenoy
Copy link
Author

@jsign do you know what version this is happening? I thjnk this was 1.1.3 but maybe a diff commit

@jsign
Copy link
Contributor

jsign commented Nov 21, 2020

This was with official v1.1.3.

@dineshshenoy dineshshenoy added this to the Reduce Resources For Storage Deals milestone Nov 23, 2020
@jsign
Copy link
Contributor

jsign commented Nov 25, 2020

I had the opportunity to get into this situation again:

Daemon:  1.2.1+git.97a76a4b9+api1.0.0
Local: lotus version 1.2.1+git.97a76a4b9

(Which is this PR, so some master-ish)

Now having 4 sending data transfers of 4GiB to miners.
image
Took multiple screenshots of ^, so missed the one with full RAM... but you can imagine full ram triggered 10GiB of swap.

Here some pprof of the heap and goroutines running:
pprofs.zip

cc @hannahhoward @Stebalien @dirkmc @dineshshenoy

@kernelogic
Copy link

So is this proportional to the number of deals or the size of the deals?

@jsign
Copy link
Contributor

jsign commented Nov 27, 2020

Some other profile with full ram and 21GiB of swap:
pprof20201127.zip

@Stebalien
Copy link
Member

graphsync 0.5.1 (the version you're using) should max out at 6 requests and 256MiB of memory so something is definitely funny here. In theory, it could be that GC is falling behind (especially if you're using swap) but really, if you're just sending a few deals as a client, we should never get to that point.

@dineshshenoy
Copy link
Author

The max of 6 requests was relaxed recently which may be why this issue came up. cc: @dirkmc

@Stebalien
Copy link
Member

That shouldn't matter too much (I only bring it up because there may be some per-request overhead or over-allocation, not sure). However, the version @jsign is using has that restriction.

@hannahhoward
Copy link
Contributor

goodness I wish go had better ref-count diagnostics for catching memory leaks. It will be interesting with the memory watchdog, which can help us remove GC as a cause. We identified a particular memory leak -- but there may be others. We also probably need to look at the IPFS Block Store code cause it may be written so that its hold a reference in a way it should.

@raulk
Copy link
Member

raulk commented Dec 3, 2020

Memory watchdog is now merged. You can set a maximum heap limit through the LOTUS_MAX_HEAP env variable. Docs here: filecoin-project/filecoin-docs#609

Could you also follow the steps here? #4445 (comment)

And also throw in a gctrace=1 into that GODEBUG env variable (comma-separated).

@dineshshenoy dineshshenoy modified the milestones: Reduce Resources For Storage Deals, 💹Storage Deal Success Dec 3, 2020
@dineshshenoy dineshshenoy assigned dirkmc and unassigned raulk Dec 3, 2020
@dirkmc
Copy link
Contributor

dirkmc commented Dec 7, 2020

It looks like the problem is somewhere in this code path:
https://github.com/ipfs/go-graphsync/blob/11d30c607e3f9062b1a3755dba34f6abcb9eaf85/responsemanager/runtraversal/runtraversal.go#L21-L56

image

My guess is that either Traverser.Advance(blockBuffer) or sendResponse(lnk, data) is keeping a reference to the data slice that never gets cleaned up.

I dug around in the code but I wasn't able to figure out exactly where that might be happening. @hannahhoward do you have an idea where it might occur?

@jsign
Copy link
Contributor

jsign commented Dec 21, 2020

Leaving here some story of a thing happening while upgrading the Lotus node we are using for deal-success tests.

This node had ~8 in-progress 16GiB (padded) deals, with many of them still pending the data-transfer stage (so, many things pending to transfer). When I updated to v1.4.0, I saw the node was struggling a lot with syncing.

A further inspection showed that using a lot of RAM, better described:

 |        Memory         |    SWAP
 | used  free  buff  cach| used  free>
 |30.7G  273M 2836k  260M|  24G   26G>

So ~54GiB of RAM used. I was pretty sure this was related to data-transfers.

I restarted with:

LOTUS_MAX_HEAP=30GiB
GODEBUG=madvdontneed=1

as to be quite aggressive on discarding some possible situations from the Go GC implementation.
After restarting, the same RAM behavior happened... so sounded like in_use memory reported by the Go runtime should be high.

I took some profile, and this is further confirmation of some things:
image

  • in_use space 54GiB, so this explains why the watchdog max heap or forcing returning freed mem pages to the kernel didn't help.
  • confirms again the path of memory references being investigated.

The only way I could get out of this situation was by SimultaneousTransfers = 1, since I was fearing getting out of SWAP and being OOM killed risking some data corruption. This worked to avoid memory ramp-up, so it all makes sense.

@kernelogic
Copy link

SimultaneousTransfers where do you set this variable? If you limit it to 1 and that deal got stuck you are stuck forever?

@jsign
Copy link
Contributor

jsign commented Dec 21, 2020

SimultaneousTransfers where do you set this variable?

In [Client] section.

If you limit it to 1 and that deal got stuck you are stuck forever?

AFAIK, yes, so it's mostly a workaround.

@raulk
Copy link
Member

raulk commented Mar 11, 2021

go-graphsync v0.6.0 fixed the memory scaling issues with graphsync transfers, by moving the memory allocation to the message queue. It was first integrated in Lotus v1.5.1. All our reports indicate that data transfer no longer allocates memory linearly with the file size.

Please open new issues if you find other issues with memory footprint during data transfer.

@raulk raulk closed this as completed Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/markets/client area/markets/storage P2 P2: Should be resolved team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs
Projects
None yet
Development

No branches or pull requests

8 participants