-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subsystem-bench: cache misses profiling #2893
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looking good to me.
I think we should avoid printing cachegrind output directly to stdout, as it can be confusing. Either print to a file or prepend the valgrind stdout with a header that specifies that valgrind output follows.
@@ -198,6 +216,52 @@ impl BenchCli { | |||
} | |||
} | |||
|
|||
#[cfg(target_os = "linux")] | |||
fn is_valgrind_mode() -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could add all of these functions to a linux-only valgrind
module for better encapsulation. also, we could avoid having empty valgrind functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's good to extract it to a module, but how to avoid empty functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you add #![cfg(target_os = "linux")
to the top of the valgrind file, it'll only be compiled on linux. Then you'd have to only call the valgrind functions on linux (add #[cfg()]s to the calling code). Then you wouldn't need empty functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! There are additional options to cache sim which might be useful:
--I1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 instruction cache. Only useful with --cache-sim=yes.
--D1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 data cache. Only useful with --cache-sim=yes.
--LL=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the last-level cache. Only useful with --cache-sim=yes.
The documentation states that currently the simulator approximates a AMD Athlon CPU circa 2002
which is worse than ref hw spec. I think we should tune these values to the ref hardware or the actual host configuration.
#[cfg(target_os = "linux")] | ||
fn valgrind_init() -> eyre::Result<()> { | ||
use std::os::unix::process::CommandExt; | ||
std::process::Command::new("valgrind") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't look that we get an error printed if valgrind
is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
That's a good idea. Unfortunately, I couldn't to find a way how to catch the report from stderr, because it appears after the process has completed. So I print it to a report file, which is a good option imho. |
I think you could use https://doc.rust-lang.org/std/process/struct.Command.html#method.output for this (which enables you to get stderr as well). But printing to a file is good as well IMO 👍🏻 |
@sandreim I tuned the simulation config to Intel Ice Lake CPU. |
Why we need it
To provide another level of understanding to why polkadot's subsystems may perform slower than expected. Cache misses occur when processing large amounts of data, such as during availability recovery.
Why Cachegrind
Cachegrind has many drawbacks: it is slow, it uses its own cache simulation, which is very basic. But unlike
perf
, which is a great tool, Cachegrind can run in a virtual machine. This means we can easily run it in remote installations and even use it in CI/CD to catch possible regressions.Why Cachegrind and not Callgrind, another part of Valgrind? It is simply empirically proven that profiling runs faster with Cachegrind.
First results
First results have been obtained while testing of the approach. Here is an example.
The CLI output shows that 1.4% of the L1 data cache missed, which is not so bad, given that the last-level cache had that data most of the time missing only 0.3%. Instruction data of the L1 has 0.00% misses of the time. Looking at an output file with
cg_annotate
shows that most of the misses occur during reed-solomon, which is expected.