Data Transfers are RAM hungry #4877

dineshshenoy · 2020-11-16T21:32:50Z

The memory usage when creating client deals is very high. @jsign created 5 deals (40 GB total) which increased RAM usage by 15GB until transfers finished. Lotus total peak RAM usage = 40GB for these 5 concurrent-ish deals

cc: @hannahhoward

Stebalien · 2020-11-21T00:35:12Z

Which version?

dineshshenoy · 2020-11-21T02:51:24Z

@jsign do you know what version this is happening? I thjnk this was 1.1.3 but maybe a diff commit

jsign · 2020-11-21T19:06:59Z

This was with official v1.1.3.

jsign · 2020-11-25T18:24:24Z

I had the opportunity to get into this situation again:

Daemon:  1.2.1+git.97a76a4b9+api1.0.0
Local: lotus version 1.2.1+git.97a76a4b9

(Which is this PR, so some master-ish)

Now having 4 sending data transfers of 4GiB to miners.

Took multiple screenshots of ^, so missed the one with full RAM... but you can imagine full ram triggered 10GiB of swap.

Here some pprof of the heap and goroutines running:
pprofs.zip

cc @hannahhoward @Stebalien @dirkmc @dineshshenoy

kernelogic · 2020-11-25T18:30:57Z

So is this proportional to the number of deals or the size of the deals?

jsign · 2020-11-27T12:27:10Z

Some other profile with full ram and 21GiB of swap:
pprof20201127.zip

Stebalien · 2020-11-30T17:50:42Z

graphsync 0.5.1 (the version you're using) should max out at 6 requests and 256MiB of memory so something is definitely funny here. In theory, it could be that GC is falling behind (especially if you're using swap) but really, if you're just sending a few deals as a client, we should never get to that point.

dineshshenoy · 2020-11-30T17:53:19Z

The max of 6 requests was relaxed recently which may be why this issue came up. cc: @dirkmc

Stebalien · 2020-11-30T18:05:52Z

That shouldn't matter too much (I only bring it up because there may be some per-request overhead or over-allocation, not sure). However, the version @jsign is using has that restriction.

hannahhoward · 2020-12-02T16:59:52Z

goodness I wish go had better ref-count diagnostics for catching memory leaks. It will be interesting with the memory watchdog, which can help us remove GC as a cause. We identified a particular memory leak -- but there may be others. We also probably need to look at the IPFS Block Store code cause it may be written so that its hold a reference in a way it should.

raulk · 2020-12-03T15:20:21Z

Memory watchdog is now merged. You can set a maximum heap limit through the LOTUS_MAX_HEAP env variable. Docs here: filecoin-project/filecoin-docs#609

Could you also follow the steps here? #4445 (comment)

And also throw in a gctrace=1 into that GODEBUG env variable (comma-separated).

dirkmc · 2020-12-07T14:52:52Z

It looks like the problem is somewhere in this code path:
https://github.com/ipfs/go-graphsync/blob/11d30c607e3f9062b1a3755dba34f6abcb9eaf85/responsemanager/runtraversal/runtraversal.go#L21-L56

My guess is that either Traverser.Advance(blockBuffer) or sendResponse(lnk, data) is keeping a reference to the data slice that never gets cleaned up.

I dug around in the code but I wasn't able to figure out exactly where that might be happening. @hannahhoward do you have an idea where it might occur?

jsign · 2020-12-21T12:47:27Z

Leaving here some story of a thing happening while upgrading the Lotus node we are using for deal-success tests.

This node had ~8 in-progress 16GiB (padded) deals, with many of them still pending the data-transfer stage (so, many things pending to transfer). When I updated to v1.4.0, I saw the node was struggling a lot with syncing.

A further inspection showed that using a lot of RAM, better described:

 |        Memory         |    SWAP
 | used  free  buff  cach| used  free>
 |30.7G  273M 2836k  260M|  24G   26G>

So ~54GiB of RAM used. I was pretty sure this was related to data-transfers.

I restarted with:

LOTUS_MAX_HEAP=30GiB
GODEBUG=madvdontneed=1

as to be quite aggressive on discarding some possible situations from the Go GC implementation.
After restarting, the same RAM behavior happened... so sounded like in_use memory reported by the Go runtime should be high.

I took some profile, and this is further confirmation of some things:

in_use space 54GiB, so this explains why the watchdog max heap or forcing returning freed mem pages to the kernel didn't help.
confirms again the path of memory references being investigated.

The only way I could get out of this situation was by SimultaneousTransfers = 1, since I was fearing getting out of SWAP and being OOM killed risking some data corruption. This worked to avoid memory ramp-up, so it all makes sense.

kernelogic · 2020-12-21T18:47:09Z

SimultaneousTransfers where do you set this variable? If you limit it to 1 and that deal got stuck you are stuck forever?

jsign · 2020-12-21T19:30:10Z

SimultaneousTransfers where do you set this variable?

In [Client] section.

If you limit it to 1 and that deal got stuck you are stuck forever?

AFAIK, yes, so it's mostly a workaround.

raulk · 2021-03-11T10:30:38Z

go-graphsync v0.6.0 fixed the memory scaling issues with graphsync transfers, by moving the memory allocation to the message queue. It was first integrated in Lotus v1.5.1. All our reports indicate that data transfer no longer allocates memory linearly with the file size.

Please open new issues if you find other issues with memory footprint during data transfer.

dineshshenoy added area/client/storage labels Nov 16, 2020

dineshshenoy added this to the Reduce Resources For Storage Deals milestone Nov 23, 2020

dineshshenoy assigned raulk Nov 27, 2020

raulk mentioned this issue Dec 2, 2020

introduce memory watchdog; LOTUS_MAX_HEAP #5101

Merged

dineshshenoy modified the milestones: Reduce Resources For Storage Deals, 💹Storage Deal Success Dec 3, 2020

dineshshenoy assigned dirkmc and unassigned raulk Dec 3, 2020

raulk mentioned this issue Dec 8, 2020

concept of a miner DMZ #5149

Closed

dineshshenoy added the P2 P2: Should be resolved label Jan 5, 2021

dineshshenoy mentioned this issue Jan 27, 2021

Client memory usage is high when executing retrieval deal #5441

Closed

jennijuju added the area/markets/client label Feb 11, 2021

jennijuju added area/markets/storage and removed area/client/storage labels Feb 11, 2021

raulk added the team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs label Feb 23, 2021

raulk mentioned this issue Feb 25, 2021

Epic: Reducing the data transfer memory footprint on both client and miner #5680

Open

raulk closed this as completed Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Transfers are RAM hungry #4877

Data Transfers are RAM hungry #4877

dineshshenoy commented Nov 16, 2020 •

edited

Loading

Stebalien commented Nov 21, 2020

dineshshenoy commented Nov 21, 2020

jsign commented Nov 21, 2020

jsign commented Nov 25, 2020 •

edited

Loading

kernelogic commented Nov 25, 2020

jsign commented Nov 27, 2020

Stebalien commented Nov 30, 2020

dineshshenoy commented Nov 30, 2020

Stebalien commented Nov 30, 2020

hannahhoward commented Dec 2, 2020

raulk commented Dec 3, 2020

dirkmc commented Dec 7, 2020

jsign commented Dec 21, 2020

kernelogic commented Dec 21, 2020

jsign commented Dec 21, 2020 •

edited

Loading

raulk commented Mar 11, 2021

Data Transfers are RAM hungry #4877

Data Transfers are RAM hungry #4877

Comments

dineshshenoy commented Nov 16, 2020 • edited Loading

Stebalien commented Nov 21, 2020

dineshshenoy commented Nov 21, 2020

jsign commented Nov 21, 2020

jsign commented Nov 25, 2020 • edited Loading

kernelogic commented Nov 25, 2020

jsign commented Nov 27, 2020

Stebalien commented Nov 30, 2020

dineshshenoy commented Nov 30, 2020

Stebalien commented Nov 30, 2020

hannahhoward commented Dec 2, 2020

raulk commented Dec 3, 2020

dirkmc commented Dec 7, 2020

jsign commented Dec 21, 2020

kernelogic commented Dec 21, 2020

jsign commented Dec 21, 2020 • edited Loading

raulk commented Mar 11, 2021

dineshshenoy commented Nov 16, 2020 •

edited

Loading

jsign commented Nov 25, 2020 •

edited

Loading

jsign commented Dec 21, 2020 •

edited

Loading