Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate CFE memory consumption during Keygen #1646

Closed
morelazers opened this issue May 9, 2022 · 16 comments
Closed

Investigate CFE memory consumption during Keygen #1646

morelazers opened this issue May 9, 2022 · 16 comments
Assignees
Labels
CFE effort-8 Highly likely to encounter unknown unknowns p1-asap

Comments

@morelazers
Copy link

morelazers commented May 9, 2022

In 150 node testnets during keygen ceremonys some node's CFEs are killed by the "OOM Killer" due to the system running out of memory. This behavior has been reproducible.

The increase in memory usage occurs only after starting the keygen ceremony, so it seems likely the increase is due to message sizes which are cubic in number of nodes. But this doesn't seem to account of all the increase.

The next thing to do will be to run the nodes with 16GB of RAM per node (Instead of the previous 8GB), so we can actually see what the peak memory usage is during a keygen ceremony.

Related issues:

#1648
#1541
#1540
#1517
#732

"Here is how I compute the ... (total size of all messages) ... (in bytes) for the largest stage: (N * 0.67 * 88 + 144) * N * N
This gives me ~200mb for 150 nodes, ~60mb for 100 nodes and ~7mb for 50 nodes."

image

@AlastairHolmes AlastairHolmes self-assigned this May 9, 2022
@morelazers morelazers added CFE p1-asap effort-8 Highly likely to encounter unknown unknowns labels May 9, 2022
@nakul-cf nakul-cf changed the title Investigate CFE memory consumption during Keygen [SC-3418] Investigate CFE memory consumption during Keygen May 9, 2022
@morelazers morelazers added this to the 0.3.1 - Airplane Peanuts milestone May 24, 2022
@morelazers morelazers changed the title [SC-3418] Investigate CFE memory consumption during Keygen Investigate CFE memory consumption during Keygen May 24, 2022
@morelazers morelazers self-assigned this Jun 7, 2022
@AlastairHolmes
Copy link
Contributor

I think once we have a working 16GB testnet, we'll want to write some tools to help analysis memory usage (Not sure how those could look though).

@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Jun 8, 2022

My first observation is that one of the futures (guessing its the multisig calculations) seem to be starving the all other futures from running for about 1 minute. This can happen if there is a long running task, that don't yield (i.e. doesn't have any awaits in it).

Here you can see the periodic (every 10 secs) ceremony timeout check/cleanup is running less frequently around the memory spikes (Indicating it is being starved of CPU time). I found this behavior on all the nodes I checked.

image

This is not going to be the cause to be clear though. But we may wish to move to using multiple threads. OR run the multisig code on its own thread.

See 40 second gap in logs:

image

@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Jun 8, 2022

It does appear to never be running the code to remove the expired ceremonies (Again I checked this on several nodes), this could just mean the ceremonies just complete normally:

image

Update: Both keygen are reported successfully completing so the above is normal:

image

Although it is interesting both happen at the same time (Both report_keygen_outcomes).

@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Jun 8, 2022

Just for the record (Of course we know this already) the memory usage is wildly different on nodes:

Non-genesis nodes:

image

Key:

Red: keygen-memory-validator-4
Orange: keygen-memory-validator-39
Blue: keygen-memory-validator-40
Light Blue: keygen-memory-validator-48
Light Red: keygen-memory-validator-69

Genesis Nodes (I assume there were 3 genesis nodes? @morelazers )

image

Key:

Green: keygen-memory-validator-0
Yellow: keygen-memory-validator-1
Purple: keygen-memory-validator-2

@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Jun 8, 2022

For keygen-memory-validator-1 above:

The first spike is around the secret share stage (16:48) of ceremony_id 1, but oddly when ceremony id 5 gets to the same point (17:05) the memory doesn't spike at all (Despite receiving messages and processing the stage).

Ceremony id 1 transitions:

image

Ceremony id 5 transitions:

image

Note it is very strange that nodes (0 and 1) are also taking part in the ceremony (and transtion stages), but don't have high memory spikes (See above charts).

@morelazers
Copy link
Author

Yes there were three genesis nodes.

@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Jun 8, 2022

The increases in memory usage do seem to occur when receiving p2p messages (for stage 4/5, note they are called stage 2/3 in this branch) (although some nodes receive messages but don't significantly increase memory usage.

The drops in memory usage seem to occur 10-15 seconds after a keygen ceremony completes with this log message "Ceremony reached the final stage!".

@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Jun 8, 2022

I'm not totally confident the timing of the memory usage logs isn't off a little, it certainly seems to be slow by 10-20 seconds.

For example here:

image

Above there is a gap where the memory usage stops increasing for a bit (Between 17:05:15 -> 17:05:30). The increase in memory usage correlates with the receiving of p2p messages, but the gap in the p2p messages is at little later in time:

image

@morelazers
Copy link
Author

The scraping of the metrics happens every 30 seconds.

@morelazers morelazers removed their assignment Jun 10, 2022
@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Jun 21, 2022

Using IPC instead RPC worked (Note nodes start around 13:55):

image

image

Node 19 ceremonies:

image

image

image

@AlastairHolmes
Copy link
Contributor

The high memory usage that stays around above, is associated with ceremonies that have not yet finished (Which you can see in the logs above).

@AlastairHolmes
Copy link
Contributor

Also note the very low memory usage nodes (Are part of a different testnet) (The brown ones at the bottom of the graph)

@AlastairHolmes
Copy link
Contributor

Also the resident memory usage is def meaningless and confusing (virtual is much more useful).

@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Jun 21, 2022

There is still some odd behaviour:

image

@AlastairHolmes
Copy link
Contributor

This increase lines up with a failed "not contract compatible" keygen, but the memory usage never goes down. It is possible some buffers are allocated that are only used in the failed path (This is the first keygen failure). Not sure what those would be though.

@msgmaxim
Copy link
Contributor

It is possible some buffers are allocated that are only used in the failed path

I don't think there is any extra allocation on the failed path. Most likely the memory usage is for the preceeding stage 4 messages. But yeah, not clear why it stays on the same level afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CFE effort-8 Highly likely to encounter unknown unknowns p1-asap
Projects
None yet
Development

No branches or pull requests

3 participants