-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate CFE memory consumption during Keygen #1646
Comments
I think once we have a working 16GB testnet, we'll want to write some tools to help analysis memory usage (Not sure how those could look though). |
My first observation is that one of the futures (guessing its the multisig calculations) seem to be starving the all other futures from running for about 1 minute. This can happen if there is a long running task, that don't yield (i.e. doesn't have any awaits in it). Here you can see the periodic (every 10 secs) ceremony timeout check/cleanup is running less frequently around the memory spikes (Indicating it is being starved of CPU time). I found this behavior on all the nodes I checked. This is not going to be the cause to be clear though. But we may wish to move to using multiple threads. OR run the multisig code on its own thread. See 40 second gap in logs: |
It does appear to never be running the code to remove the expired ceremonies (Again I checked this on several nodes), this could just mean the ceremonies just complete normally: Update: Both keygen are reported successfully completing so the above is normal: Although it is interesting both happen at the same time (Both report_keygen_outcomes). |
Just for the record (Of course we know this already) the memory usage is wildly different on nodes: Non-genesis nodes: Key: Red: keygen-memory-validator-4 Genesis Nodes (I assume there were 3 genesis nodes? @morelazers ) Key: Green: keygen-memory-validator-0 |
For keygen-memory-validator-1 above: The first spike is around the secret share stage (16:48) of ceremony_id 1, but oddly when ceremony id 5 gets to the same point (17:05) the memory doesn't spike at all (Despite receiving messages and processing the stage). Ceremony id 1 transitions: Ceremony id 5 transitions: Note it is very strange that nodes (0 and 1) are also taking part in the ceremony (and transtion stages), but don't have high memory spikes (See above charts). |
Yes there were three genesis nodes. |
The increases in memory usage do seem to occur when receiving p2p messages (for stage 4/5, note they are called stage 2/3 in this branch) (although some nodes receive messages but don't significantly increase memory usage. The drops in memory usage seem to occur 10-15 seconds after a keygen ceremony completes with this log message "Ceremony reached the final stage!". |
I'm not totally confident the timing of the memory usage logs isn't off a little, it certainly seems to be slow by 10-20 seconds. For example here: Above there is a gap where the memory usage stops increasing for a bit (Between 17:05:15 -> 17:05:30). The increase in memory usage correlates with the receiving of p2p messages, but the gap in the p2p messages is at little later in time: |
The scraping of the metrics happens every 30 seconds. |
The high memory usage that stays around above, is associated with ceremonies that have not yet finished (Which you can see in the logs above). |
Also note the very low memory usage nodes (Are part of a different testnet) (The brown ones at the bottom of the graph) |
Also the resident memory usage is def meaningless and confusing (virtual is much more useful). |
This increase lines up with a failed "not contract compatible" keygen, but the memory usage never goes down. It is possible some buffers are allocated that are only used in the failed path (This is the first keygen failure). Not sure what those would be though. |
I don't think there is any extra allocation on the failed path. Most likely the memory usage is for the preceeding stage 4 messages. But yeah, not clear why it stays on the same level afterwards. |
In 150 node testnets during keygen ceremonys some node's CFEs are killed by the "OOM Killer" due to the system running out of memory. This behavior has been reproducible.
The increase in memory usage occurs only after starting the keygen ceremony, so it seems likely the increase is due to message sizes which are cubic in number of nodes. But this doesn't seem to account of all the increase.
The next thing to do will be to run the nodes with 16GB of RAM per node (Instead of the previous 8GB), so we can actually see what the peak memory usage is during a keygen ceremony.
Related issues:
#1648
#1541
#1540
#1517
#732
"Here is how I compute the ... (total size of all messages) ... (in bytes) for the largest stage: (N * 0.67 * 88 + 144) * N * N
This gives me ~200mb for 150 nodes, ~60mb for 100 nodes and ~7mb for 50 nodes."
The text was updated successfully, but these errors were encountered: