-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(ring_outlier_filter): a cache friendly impl #3293
perf(ring_outlier_filter): a cache friendly impl #3293
Conversation
Using the setup described here #3269 (comment), I can confirm this implementation is "cache friendly". Compared to the initial implementation, cache-miss share of
The biggest source of cache miss in the filter is now a data copy (accessed linearly):
One big hint the code is more cache friendly is that there is no big source of cache miss anymore. |
ca5d15b
to
e3936e0
Compare
2701df3
to
22e1aaf
Compare
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## main #3293 +/- ##
==========================================
- Coverage 14.09% 14.08% -0.02%
==========================================
Files 1446 1445 -1
Lines 102069 102144 +75
Branches 27228 27228
==========================================
Hits 14388 14388
- Misses 71885 71960 +75
Partials 15796 15796
*This pull request uses carry forward flags. Click here to find out more.
☔ View full report in Codecov by Sentry. |
22e1aaf
to
0edd97e
Compare
I have tested and benchmarked a few variations. In particular I benchmarked my impl against 82262b9.
The benchmark setup is described here. CPU clock is locked 3GHz, and I am using the same VLP-32 rosbag data for all.
All variations have approximately the same performance. The proposed implementation is more stable (but not by a huge amount) The big difference is that 82262b9 cache miss count is roughly twice as big the proposed implementation. Using perf to monitor the process over 120 seconds each. I measured:
So even though the individual performance is roughly the same, there is a big difference in term of cache performance (which impacts the performance of other processes). |
85c7823
to
9a24f2c
Compare
9a24f2c
to
66ccb82
Compare
Rebased the code, then benchmarked again following protocol described here #3269 (comment) The rosbag used for benchmark this time is one of autoware's demo rosbag , whose pointcloud data scans contain each ~218000 points (from a VLP 128 I guess?). The rosbag is pretty short though (~30 seconds).
This rosbag is maybe a bit unfair for the current implementation, because the data is really heavier. But it also highlights the impact of cache-misses. Cache-misses on current implementation:
Same as my previous report, the
Now with the current PR, the number of cache-miss in
Note that there are more cache-misses than before my rebate. I don't know if it's related to the point cloud size or if it is because the code can't memcpy that much anymore. It is unclear where exactly is the biggest source of cache miss, I feel it is in one of the lambda helpers (e.g.
Note that these functions could be made faster if Autoware enforced PointCloud2 data field format (see #2618 (comment)). The output data "looks ok", but I don't have any real test. I don't have many rosbag either, so any help would be appreciated. |
cb23f0e
to
346bbbd
Compare
Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> style(pre-commit): autofix fix(ring_outlier_filter): cleanup code with ranges Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> [breaking] fix autoware point type padding for faster memory access can memcpy between input data buffer and PointXYZI* make assumption on memory layout for faster data fetch misc optimization (reordering, constexpr, etc) Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> style(pre-commit): autofix comment limitations Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> feat(ring_outlier_filter): add accurate isCluster impl Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> style(pre-commit): autofix fix autowarefoundation#3218 Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> cleaning Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> style(pre-commit): autofix
Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>
346bbbd
to
b7e54f6
Compare
Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>
Are performance improvements measured without the influence of the operating system? It not so, could you please refer to this document? |
@sykwer Very interesting reading. https://github.com/sykwer/ros2_single_node_replayer and https://github.com/sykwer/pmu_analyzer look very nice. I let you guess have not used these ressources for my setup ^^. What I have done to reduce interferences:
So very far from compiling a custom kernel. Regarding node isolation, I am not sure it makes sense in this case. In the original issue, I pointed out that cache performance was most likely an issue with the filter implementation. In a first attempt to quantify the problem, I used a rosbag to feed data directly to the filter. Doing so blows up the cache-misses count, because the input data is never in the cache. Even though cache access is made more efficient, it is difficult to see any improvement with this setup. On the other hand, when Autoware is running, it is likely the input data is hot in the cache when the filter tries to access it. This is why I decided to run the whole preprocessing pipeline for my benchmark (the last 2 components could have worked as well). Not only that, but inefficient cache usage is most impactful when many nodes are running: when a cache line is invalidated, many processes are impacted, not just the one responsible for the cache miss. As I reported in the original issue, simply skipping the ring outlier filter on our vehicle reduced htop cpu usage by 20%... yet the filter was not using 20% of the cpu itself. In any case, it seems that this PR is currently broken (from the last few rebase, but after my last benchmark). The ring index is not read properly anymore (or input data is invalid) and my code reports:
I need to investigate that. |
@VRichardJP I've set it to draft since you've reported that it stopped working correctly after the last rebase. |
@xmfcx Sorry I don't have much time to work on Autoware recently. |
@VRichardJP I would like to troubleshoot the garbage data issue, and verify and complete the pull request if that is okay with you! |
@mojomex Sorry, I really can't find much free time to work on Autoware recently. |
Description
This is a "cache friendly" reimplementation of ring outlier filter. See #3269 for more details.
The code is based on @sykwer WiP improvements (see #3014), which I rewrote to improve memory access (hopefully).
This implementation produce slightly different output than the original implementation:
ring_outlier_filter
skips last input point #3217ring_outlier_filter
ignores ring's "end" point and "start" point are neighbors #3218Related links
#3269
#3014
Tests performed
Note: these tests are fairly old and refer to older implementation of ring outlier filters. See recent comments for comparisons between the current implementation and this PR.
Simple test on a few random rosbags. Plot data is the output of
/ring_outlier_filter/debug/processing_time_ms
. In blue the current implementation (with hashmap), in orange the proposed implementation (see the more recent comments for more accurate benchmarks):On VLP-16 data (~20000 points per message):
![image](https://user-images.githubusercontent.com/18645627/230264541-0672edee-c302-4056-8e47-b55dc7416332.png)
On VLP-32 data (~50000 points per message):
![image](https://user-images.githubusercontent.com/18645627/230308580-84bd1286-9c30-4abc-a335-317898cc1f78.png)
In both cases, my implementation is roughly 3 times faster than the original one. Note that I am not using the TLSF allocator.
This is of course just a very simple benchmark and I will need to test more. In particular, I want to:
I don't necessarily expect this implementation to shine when the filter is the only process running on the machine. However I expect it will perform way better when the machine is stressed.
Notes for reviewers
Pre-review checklist for the PR author
The PR author must check the checkboxes below when creating the PR.
In-review checklist for the PR reviewers
The PR reviewers must check the checkboxes below before approval.
Post-review checklist for the PR author
The PR author must check the checkboxes below before merging.
After all checkboxes are checked, anyone who has write access can merge the PR.