Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(ring_outlier_filter): a cache friendly impl #3293

Conversation

VRichardJP
Copy link
Contributor

@VRichardJP VRichardJP commented Apr 6, 2023

Description

This is a "cache friendly" reimplementation of ring outlier filter. See #3269 for more details.

The code is based on @sykwer WiP improvements (see #3014), which I rewrote to improve memory access (hopefully).

This implementation produce slightly different output than the original implementation:

Related links

#3269
#3014

Tests performed

Note: these tests are fairly old and refer to older implementation of ring outlier filters. See recent comments for comparisons between the current implementation and this PR.

Simple test on a few random rosbags. Plot data is the output of /ring_outlier_filter/debug/processing_time_ms. In blue the current implementation (with hashmap), in orange the proposed implementation (see the more recent comments for more accurate benchmarks):

  • On VLP-16 data (~20000 points per message):
    image

  • On VLP-32 data (~50000 points per message):
    image

In both cases, my implementation is roughly 3 times faster than the original one. Note that I am not using the TLSF allocator.

This is of course just a very simple benchmark and I will need to test more. In particular, I want to:

  • compare cache performance with old implementation
  • compare performance with @sykwer implementation (after his changes have been merged)
  • compare performance on vehicle, with all autoware modules running

I don't necessarily expect this implementation to shine when the filter is the only process running on the machine. However I expect it will perform way better when the machine is stressed.

Notes for reviewers

Pre-review checklist for the PR author

The PR author must check the checkboxes below when creating the PR.

In-review checklist for the PR reviewers

The PR reviewers must check the checkboxes below before approval.

  • The PR follows the pull request guidelines.
  • The PR has been properly tested.
  • The PR has been reviewed by the code owners.

Post-review checklist for the PR author

The PR author must check the checkboxes below before merging.

  • There are no open discussions or they are tracked via tickets.
  • The PR is ready for merge.

After all checkboxes are checked, anyone who has write access can merge the PR.

@github-actions github-actions bot added type:documentation Creating or refining documentation. (auto-assigned) component:sensing Data acquisition from sensors, drivers, preprocessing. (auto-assigned) labels Apr 6, 2023
@mitsudome-r mitsudome-r requested a review from sykwer April 6, 2023 06:29
@VRichardJP
Copy link
Contributor Author

VRichardJP commented Apr 7, 2023

Using the setup described here #3269 (comment), I can confirm this implementation is "cache friendly".

Compared to the initial implementation, cache-miss share of RingOutlierFilterCompoment dropped from 12.59% to 4.40% (count of cache miss also dropped a lot):

Samples: 8K of event 'cache-misses', Event count (approx.): 158292133
Overhead  Command          Shared Object                                              Symbol
  32.79%  component_conta  libc.so.6                                                  [.] __memset_avx2_unaligned_erms                                                                    ◆
   8.03%  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::CropBoxFilterComponent::faster_filter                                  ▒
   6.84%  component_conta  libc.so.6                                                  [.] __memmove_avx_unaligned_erms                                                                    ▒
   6.29%  component_conta  libc.so.6                                                  [.] pthread_mutex_lock@@GLIBC_2.2.5                                                                 ▒
   4.40%  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter                              ▒
   3.73%  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::PointCloudConcatenateDataSynchronizerComponent::convertToXYZICloud     ▒

The biggest source of cache miss in the filter is now a data copy (accessed linearly):

       │      output_size += sizeof(PointXYZI);                                                                                                                                           ▒
  0.52 │        add      $0x14,%r9                                                                                                                                                        ▒
       │      memcpy():                                                                                                                                                                   ▒
       │        movups   %xmm5,(%rdx)                                                                                                                                                     ▒
  4.35 │        mov      0x10(%rax),%ecx                                                                                                                                                  ▒
  3.20 │        mov      %ecx,0x10(%rdx)                                                                                                                                                  ▒
       │      pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter(std::shared_ptr<sensor_msgs::msg::PointCloud2_<std::allocator<void> > const> const&, std::shared_ptr<std:▒

One big hint the code is more cache friendly is that there is no big source of cache miss anymore.

@VRichardJP VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from ca5d15b to e3936e0 Compare April 7, 2023 05:26
@github-actions github-actions bot added the component:common Common packages from the autoware-common repository. (auto-assigned) label Apr 7, 2023
@VRichardJP VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from 2701df3 to 22e1aaf Compare April 10, 2023 23:27
@codecov
Copy link

codecov bot commented Apr 11, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.02 ⚠️

Comparison is base (4eef087) 14.09% compared to head (8d8117f) 14.08%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3293      +/-   ##
==========================================
- Coverage   14.09%   14.08%   -0.02%     
==========================================
  Files        1446     1445       -1     
  Lines      102069   102144      +75     
  Branches    27228    27228              
==========================================
  Hits        14388    14388              
- Misses      71885    71960      +75     
  Partials    15796    15796              
Flag Coverage Δ *Carryforward flag
differential 0.00% <0.00%> (?)
total 14.10% <ø> (+<0.01%) ⬆️ Carriedforward from b7e54f6

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
...src/outlier_filter/ring_outlier_filter_nodelet.cpp 0.00% <0.00%> (ø)
.../pointcloud_preprocessor/src/utility/utilities.cpp 0.00% <0.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@VRichardJP VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from 22e1aaf to 0edd97e Compare April 12, 2023 00:30
@VRichardJP
Copy link
Contributor Author

VRichardJP commented Apr 13, 2023

I have tested and benchmarked a few variations. In particular I benchmarked my impl against 82262b9.

The benchmark setup is described here. CPU clock is locked 3GHz, and I am using the same VLP-32 rosbag data for all.

image

  original 82262b9 proposal w/ fast isCluster w/ fix 3218
average (ms) 2.708554 0.787365 0.758712 0.741636 0.737616
stdev 0.227331 0.06549 0.047685 0.044937 0.048045

All variations have approximately the same performance. The proposed implementation is more stable (but not by a huge amount) The big difference is that 82262b9 cache miss count is roughly twice as big the proposed implementation.

Using perf to monitor the process over 120 seconds each. I measured:

  • 82262b9: 393 caches misses in pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter
  • proposed implementation: only 107 cache misses

So even though the individual performance is roughly the same, there is a big difference in term of cache performance (which impacts the performance of other processes).

@VRichardJP VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from 85c7823 to 9a24f2c Compare May 22, 2023 08:30
@VRichardJP VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from 9a24f2c to 66ccb82 Compare June 1, 2023 12:52
@github-actions github-actions bot removed the component:common Common packages from the autoware-common repository. (auto-assigned) label Jun 1, 2023
@VRichardJP
Copy link
Contributor Author

VRichardJP commented Jun 1, 2023

Rebased the code, then benchmarked again following protocol described here #3269 (comment)

The rosbag used for benchmark this time is one of autoware's demo rosbag , whose pointcloud data scans contain each ~218000 points (from a VLP 128 I guess?). The rosbag is pretty short though (~30 seconds).

cache friendly PR current
average (ms) 4.26097894736842 6.2111043956044
stddev 0.756442711825567 1.58342325326683

image

This rosbag is maybe a bit unfair for the current implementation, because the data is really heavier. But it also highlights the impact of cache-misses.

Cache-misses on current implementation:

Samples: 28K of event 'cache-misses', Event count (approx.): 829442212                                                                                                                                                                       
Overhead       Samples  Command          Shared Object                                              Symbol                                                                                                                                   
  18.78%          3246  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter
  14.19%          2658  component_conta  libc.so.6                                                  [.] __memset_avx2_unaligned_erms
  11.17%          1595  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::CropBoxFilterComponent::faster_filter
   8.76%          1829  component_conta  libc.so.6                                                  [.] __memmove_avx_unaligned_erms
   4.46%          1418  component_conta  libc.so.6                                                  [.] pthread_mutex_lock@@GLIBC_2.2.5
   2.49%           374  component_conta  libvelodyne_rawdata.so                                     [.] velodyne_rawdata::RawData::unpack_vls128

Same as my previous report, the float azimuth_diff = next_azimuth - current_azimuth line is responsible for mosts cache-misses:

       │      walk_last_idx = idx;                                                                                                                                                                                                          ▒
  0.18 │        mov      %esi,%r11d                                                                                                                                                                                                         ▒
  0.67 │        mov      %esi,-0x8e8(%rbp)                                                                                                                                                                                                  ▒
       │      float azimuth_diff = next_azimuth - current_azimuth;                                                                                                                                                                          ▒
  0.09 │        lea      0x0(%r13,%rdi,1),%rdx                                                                                                                                                                                              ◆
  0.12 │        movss    (%rdx,%r12,1),%xmm5                                                                                                                                                                                                ▒
 36.99 │        subss    0x0(%r13,%rcx,1),%xmm5                                                                                                                                                                                             ▒
       │      azimuth_diff = azimuth_diff < 0.f ? azimuth_diff + 36000.f : azimuth_diff;                                                                                                                                                    ▒
  2.93 │        comiss   %xmm5,%xmm2                                                                                                                                                                                                        ▒
  0.62 │      ↓ jbe      7e6                                                                                                                                                                                                                ▒
       │        addss    %xmm4,%xmm5                                                   

Now with the current PR, the number of cache-miss in RingOutlierFilterComponent::faster_filter drops a lot.:

Samples: 25K of event 'cache-misses', Event count (approx.): 794706274                                                                                                                                                                       
Overhead       Samples  Command          Shared Object                                              Symbol                                                                                                                                   
  16.73%          2925  component_conta  libc.so.6                                                  [.] __memset_avx2_unaligned_erms
  12.55%          1703  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::CropBoxFilterComponent::faster_filter
  10.78%          1979  component_conta  libc.so.6                                                  [.] __memmove_avx_unaligned_erms
  10.13%          1888  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter
   4.32%          1233  component_conta  libc.so.6                                                  [.] pthread_mutex_lock@@GLIBC_2.2.5
   2.84%           379  component_conta  libvelodyne_rawdata.so                                     [.] velodyne_rawdata::RawData::unpack_vls128

Note that there are more cache-misses than before my rebate. I don't know if it's related to the point cloud size or if it is because the code can't memcpy that much anymore.

It is unclear where exactly is the biggest source of cache miss, I feel it is in one of the lambda helpers (e.g. getPointAzimuth()):

       │      pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter(std::shared_ptr<sensor_msgs::msg::PointCloud2_<std::allocator<void> > const> const&, std::shared_ptr<std::vector<int, std::allocator<int> > > const&, senso▒
       │      ↓ je       1320                                                                                                                                                                                                               ▒
  0.41 │        mov      -0x9d8(%rbp),%rax                                                                                                                                                                                                  ▒
  0.73 │        mov      0x8(%rdx),%ecx                                                                                                                                                                                                     ▒
 10.77 │        mov      (%rdx),%rbx                                                                                                                                                                                                        ▒
  0.34 │        mov      (%rdx,%rax,1),%eax                                                                                                                                                                                                 ▒
  6.12 │        shl      $0x20,%rax                                                                                                                                                                                                         ▒
  1.03 │        or       %rcx,%rax                                                                                                                                                                                                          ▒
  1.97 │        mov      -0x9c0(%rbp),%rcx                                                                                                                                                                                                  ▒

Note that these functions could be made faster if Autoware enforced PointCloud2 data field format (see #2618 (comment)).

The output data "looks ok", but I don't have any real test. I don't have many rosbag either, so any help would be appreciated.

@VRichardJP VRichardJP marked this pull request as ready for review June 1, 2023 12:53
@VRichardJP VRichardJP requested review from amc-nu, miursh, yukkysaito and a team as code owners June 1, 2023 12:53
@VRichardJP VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from cb23f0e to 346bbbd Compare June 1, 2023 13:10
@VRichardJP VRichardJP changed the title perf(ring_outlier_filter): an attempt of cache friendly impl (WIP) perf(ring_outlier_filter): a cache friendly impl Jun 1, 2023
VRichardJP and others added 4 commits June 5, 2023 09:09
Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

style(pre-commit): autofix

fix(ring_outlier_filter): cleanup code with ranges

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

[breaking] fix autoware point type padding for faster memory access

can memcpy between input data buffer and PointXYZI*
make assumption on memory layout for faster data fetch
misc optimization (reordering, constexpr, etc)

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

style(pre-commit): autofix

comment limitations

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

feat(ring_outlier_filter): add accurate isCluster impl

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

style(pre-commit): autofix

fix autowarefoundation#3218

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

cleaning

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

style(pre-commit): autofix
Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>
Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>
@VRichardJP VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from 346bbbd to b7e54f6 Compare June 5, 2023 00:09
Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>
@xmfcx xmfcx self-requested a review June 5, 2023 12:17
@sykwer
Copy link
Member

sykwer commented Jun 6, 2023

Are performance improvements measured without the influence of the operating system? It not so, could you please refer to this document?
https://autowarefoundation.github.io/autoware-documentation/main/how-to-guides/integrating-autoware/tuning-parameters-and-performance/evaluating-real-time-performance/#performance-measurement

@VRichardJP
Copy link
Contributor Author

@sykwer Very interesting reading. https://github.com/sykwer/ros2_single_node_replayer and https://github.com/sykwer/pmu_analyzer look very nice.

I let you guess have not used these ressources for my setup ^^.

What I have done to reduce interferences:

  • only launch the pointcloud preprocessing pipeline (from the sample sensor kit)
  • only run rosbag and perf
  • used fixed CPU clock speed

So very far from compiling a custom kernel.

Regarding node isolation, I am not sure it makes sense in this case. In the original issue, I pointed out that cache performance was most likely an issue with the filter implementation. In a first attempt to quantify the problem, I used a rosbag to feed data directly to the filter. Doing so blows up the cache-misses count, because the input data is never in the cache. Even though cache access is made more efficient, it is difficult to see any improvement with this setup.

On the other hand, when Autoware is running, it is likely the input data is hot in the cache when the filter tries to access it. This is why I decided to run the whole preprocessing pipeline for my benchmark (the last 2 components could have worked as well). Not only that, but inefficient cache usage is most impactful when many nodes are running: when a cache line is invalidated, many processes are impacted, not just the one responsible for the cache miss. As I reported in the original issue, simply skipping the ring outlier filter on our vehicle reduced htop cpu usage by 20%... yet the filter was not using 20% of the cpu itself.

In any case, it seems that this PR is currently broken (from the last few rebase, but after my last benchmark). The ring index is not read properly anymore (or input data is invalid) and my code reports:

[component_container_mt-1] [WARN 1686136475.418574920] [sensing.lidar.top.ring_outlier_filter]: 59649 points had ring index over max_rings_num (128) and have been ignored. (faster_filter() at /home/sig/autoware/src/universe/autoware.universe/sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp:350)

I need to investigate that.

@xmfcx xmfcx marked this pull request as draft June 13, 2023 15:21
@xmfcx
Copy link
Contributor

xmfcx commented Jun 13, 2023

@VRichardJP I've set it to draft since you've reported that it stopped working correctly after the last rebase.

@VRichardJP
Copy link
Contributor Author

@xmfcx Sorry I don't have much time to work on Autoware recently.
During my recent tests however, I have observed the ring field of ring outlier filter input cloud is filled with garbage data. Maybe I got unlucky in my rebase, but I don't think this was related to this PR.

@mojomex
Copy link
Contributor

mojomex commented Jul 5, 2023

@VRichardJP I would like to troubleshoot the garbage data issue, and verify and complete the pull request if that is okay with you!

@VRichardJP
Copy link
Contributor Author

@mojomex Sorry, I really can't find much free time to work on Autoware recently.
I have pretty much pushed everything and documented all the details, so feel free to take over this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:sensing Data acquisition from sensors, drivers, preprocessing. (auto-assigned) type:documentation Creating or refining documentation. (auto-assigned)
Projects
None yet
4 participants