-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PGO applicability to Vector #15631
Comments
Hey @zamazan4ik, thanks for the extensive writeup! I know we've discussed this in the past, but it seems like it was probably internally on Slack as I didn't find any related issues. I'm also pretty sure we had looked at that I don't quite remember why we didn't move forward (I think even with testing it), but it's interesting to see your results here. cc @jszwedko @tobz @blt, as I'm guessing y'all were involved with that original discussion. |
Getting a 10-15% performance boost for essentially a bit of extra CI time per release is certainly an incredibly good trade-off. I think the biggest thing would just be, as you've pointed out, doing all of the legwork to figure out what platforms we can/can't do PGO on, and creating the CI steps to do it for release/nightly builds. I'd also be curious to figure out what workload is the best for PGO profiles. As an example: are any of our current soak/regression test workloads better/worse than what you used when locally testing? That sort of thing. |
Well, actually PGO has a good state across all major platforms (Linux, macOS, Windows). Probably the best source of truth regarding PGO state in Rust ecosystem is
Yes, it will be the most time-consuming and boring stuff IMO. Also, do not forget about at least x2 in the build time (instrumentation build + run on the test workload + optimizing build).
From my experience, I would say the most beneficial parts should be CPU-heavy workloads (obviously). PGO shows good results on the huge programs where we have a lot of different possible branches with a huge context. In this case, the compiler cannot make a good guess about hot-cold branching, real-life inlining, etc. That's where PGO shines. Long short story, I do not expect much performance gains in the IO-workloads (e.g. posting to ElasticSearch) simply because the network usually is much-much slower than CPU, and even if we will get a performance speed up here - we will not see it in real life. |
I think that the ideal workload for a PGO profile should exercise all the components, or at least all the component subsystems, as there would be no benefit for those components that aren't exercised. It would probably be good to see some indication of code coverage with this too, something we are also lacking. |
Good suggestion. I just want to add that this work could be done in an iterative way: add baseline loads for the components step by step. In this case, we will be able to deliver PGO improvements incrementally and avoid waiting for completion work on the preparing baseline profile for all components at once. |
@jszwedko I got some examples of how a PGO-oriented page could look like:
I think a similar approach could be used for Vector as well - just create a page with a dedicated note about PGO and put it in the Vector documentation. |
Thanks for the links @zamazan4ik ! I've come around and agree that we could add this to the docs for advanced users that are able to compile Vector themselves and run example workloads. I could see it being a subpage under https://vector.dev/docs/administration/tuning/. Feel free to open a PR if you like 🙂 |
I did some benchmarks LTO, PGO and BOLT benchmarks on Linux and want to share my numbers. The test scenario is completely the same as in #15631 (comment) . SetupMy setup is the following:
ResultsUnfortunately, I didn't manage to test LTO + PGO on Linux since on the current Rust version it's broken for some unknown yet reasons (see Kobzol/cargo-pgo#32 and llvm/llvm-project#57501 for further details). Hopefully, this will be fixed in the future. So I did some measurements on different LTO configurations with BOLT. The provided time is the time to complete the test scenario (process the same input file with
According to the results above, there are several conclusions:
@bruceg pinging you since you asked me regarding BOLT for Vector. |
Thanks for this writeup, @zamazan4ik, that's great to see. Did |
Nope, it doesn't work right now due to a compilation error in the "LTO + PGO" combination. I've created the issue in Kobzol/cargo-pgo#32 and added a comment to LLVM possibly-related bug in llvm/llvm-project#57501 (comment) . I didn't create an issue yet about this behavior in |
Bug in the upstream regarding LTO + PGO: rust-lang/rust#115344 |
TL;DR: With PGO Vector got a boost from 300-310 k/s events to 350-370 k/s events!
Hi!
I am a big fan of PGO, so I've tried to use PGO with Vector. And I wanna share with you my current results. My hypothesis is the following: even for programs with LTO, PGO can bring HUGE benefits. And I decided to test it. From my experience, PGO especially works well with large codebases with some CPU-hot parts. Looks like Vector fits really well.
Test scenario
This test scenario is completely real-life (except blackhole ofc :) ) and the log format with parse function are almost copied from our current prod env. We have patched flog tool to generate our log format (closed-source patch, sorry. I could publish it later if will be a need for it).
Example of one log entry:
<E 2296456 point.server.session 18.12 19:17:36:361298178 processCall We need to generate the solid state GB interface! (from session.cpp +713)
So Vector config is the following (toml):
You could say: "Test scenario is too simple", but:
blackhole
with smth likeelasticsearch
sink`)Test setup
Macbook M1 Pro with macOS Ventura 13.1 with 6+2 CPU on ARM (AFAIK) + 16 Gib RAM + 512 Gib SSD. Sorry, I have no Linux machine near with me right now nor a desire to test it on Linux VM or Asahi Linux setup. However, I am completely sure that results will be reproducible on the "usual" Linux-based x86-64 setup.
How to build
Vector already uses fat LTO for the release build. However, local Release build and Release build on CI are different since local Release build does not use fat LTO (since it's tooooooooooooooooooooooo time consuming). So, do not forget to add the following flags to your Release build (got them from
scripts/environment/release-flags.sh
):For performing PGO build for Vector I've used this nice wrapper: https://github.com/Kobzol/cargo-pgo . You could do it manually if you want - I am just a little bit lazy :)
The guide is simple:
cargo pgo
.cargo pgo build
. It will build the instrumented Vector version.cargo pgo run -- -- -c /Users/zamazan4ik/open_source/test_vector_logs/vector.toml
.ctrl+c
to interrupt the Vector. The profile data will be generated somewhere in thetarget
directory.cargo pgo optimize
. It will start the build again with the generated profile data.Is it worth it?
Yes! At least in my case, I have got a huge boost: from 300-310 k/s events (according to
vector top
) with default Vector release build with LTO flags from CI to 350-370 k/s with the same build + PGO enabled. So at least in my case - it's a huge boost.The comparison strategy is simple: run just LTOed Vector binary, then LTOed + PGOed Vector binary (with resetting file checkpoint ofc). And measure the total time before the whole file will be processed + track metrics via
vector top
during the execution.Results are stable and reproducible. I have performed multiple runs in different execution orders with the same results.
So what?
So what could we do with it?
Possible future steps
Possible future steps for improving:
lld
ormold
but I am not sure. AFAIK,mold
has (or had, since this awesome linker evolves quickly) some caveats with LTO builds.I hope the long read will be at least interesting for someone :) If you have any questions about it - just ask me here or on the official Vector Discord server (
zamazan4ik
nickname as well).The text was updated successfully, but these errors were encountered: