perf(ring_outlier_filter): a cache friendly impl #3293

VRichardJP · 2023-04-06T03:39:32Z

Description

This is a "cache friendly" reimplementation of ring outlier filter. See #3269 for more details.

The code is based on @sykwer WiP improvements (see #3014), which I rewrote to improve memory access (hopefully).

This implementation produce slightly different output than the original implementation:

output points are in same order than input. In the original implementation, points are reordered and copied by ring.
Fixes ring_outlier_filter skips last input point #3217
Fixes ring_outlier_filter ignores ring's "end" point and "start" point are neighbors #3218

Tests performed

Note: these tests are fairly old and refer to older implementation of ring outlier filters. See recent comments for comparisons between the current implementation and this PR.

Simple test on a few random rosbags. Plot data is the output of /ring_outlier_filter/debug/processing_time_ms. In blue the current implementation (with hashmap), in orange the proposed implementation (see the more recent comments for more accurate benchmarks):

On VLP-16 data (~20000 points per message):
On VLP-32 data (~50000 points per message):

In both cases, my implementation is roughly 3 times faster than the original one. Note that I am not using the TLSF allocator.

This is of course just a very simple benchmark and I will need to test more. In particular, I want to:

compare cache performance with old implementation
compare performance with @sykwer implementation (after his changes have been merged)
compare performance on vehicle, with all autoware modules running

I don't necessarily expect this implementation to shine when the filter is the only process running on the machine. However I expect it will perform way better when the machine is stressed.

Notes for reviewers

Pre-review checklist for the PR author

The PR author must check the checkboxes below when creating the PR.

I've confirmed the contribution guidelines.
The PR follows the pull request guidelines.

In-review checklist for the PR reviewers

The PR reviewers must check the checkboxes below before approval.

The PR follows the pull request guidelines.
The PR has been properly tested.
The PR has been reviewed by the code owners.

Post-review checklist for the PR author

The PR author must check the checkboxes below before merging.

There are no open discussions or they are tracked via tickets.
The PR is ready for merge.

After all checkboxes are checked, anyone who has write access can merge the PR.

VRichardJP · 2023-04-07T03:48:54Z

Using the setup described here #3269 (comment), I can confirm this implementation is "cache friendly".

Compared to the initial implementation, cache-miss share of RingOutlierFilterCompoment dropped from 12.59% to 4.40% (count of cache miss also dropped a lot):

Samples: 8K of event 'cache-misses', Event count (approx.): 158292133
Overhead  Command          Shared Object                                              Symbol
  32.79%  component_conta  libc.so.6                                                  [.] __memset_avx2_unaligned_erms                                                                    ◆
   8.03%  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::CropBoxFilterComponent::faster_filter                                  ▒
   6.84%  component_conta  libc.so.6                                                  [.] __memmove_avx_unaligned_erms                                                                    ▒
   6.29%  component_conta  libc.so.6                                                  [.] pthread_mutex_lock@@GLIBC_2.2.5                                                                 ▒
   4.40%  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter                              ▒
   3.73%  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::PointCloudConcatenateDataSynchronizerComponent::convertToXYZICloud     ▒

The biggest source of cache miss in the filter is now a data copy (accessed linearly):

       │      output_size += sizeof(PointXYZI);                                                                                                                                           ▒
  0.52 │        add      $0x14,%r9                                                                                                                                                        ▒
       │      memcpy():                                                                                                                                                                   ▒
       │        movups   %xmm5,(%rdx)                                                                                                                                                     ▒
  4.35 │        mov      0x10(%rax),%ecx                                                                                                                                                  ▒
  3.20 │        mov      %ecx,0x10(%rdx)                                                                                                                                                  ▒
       │      pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter(std::shared_ptr<sensor_msgs::msg::PointCloud2_<std::allocator<void> > const> const&, std::shared_ptr<std:▒

One big hint the code is more cache friendly is that there is no big source of cache miss anymore.

codecov · 2023-04-11T00:00:46Z

Codecov Report

Patch coverage has no change and project coverage change: -0.02 ⚠️

Comparison is base (4eef087) 14.09% compared to head (8d8117f) 14.08%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3293      +/-   ##
==========================================
- Coverage   14.09%   14.08%   -0.02%     
==========================================
  Files        1446     1445       -1     
  Lines      102069   102144      +75     
  Branches    27228    27228              
==========================================
  Hits        14388    14388              
- Misses      71885    71960      +75     
  Partials    15796    15796

Flag	Coverage Δ		*Carryforward flag
differential	`0.00% <0.00%> (?)`
total	`14.10% <ø> (+<0.01%)`	⬆️	Carriedforward from b7e54f6

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files	Coverage Δ
...src/outlier_filter/ring_outlier_filter_nodelet.cpp	`0.00% <0.00%> (ø)`
.../pointcloud_preprocessor/src/utility/utilities.cpp	`0.00% <0.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

VRichardJP · 2023-04-13T06:53:16Z

I have tested and benchmarked a few variations. In particular I benchmarked my impl against 82262b9.

I tried a variation using a "fast isCluster" implementation. However the performance is mostly the same than the accurate version. I guess most processing time is spent on moving data rather than some math.
I tested a version with a fix for ring_outlier_filter ignores ring's "end" point and "start" point are neighbors #3218. The fix has no significant performance impact so I am using it by default now.

The benchmark setup is described here. CPU clock is locked 3GHz, and I am using the same VLP-32 rosbag data for all.

	original	82262b9	proposal	w/ fast isCluster	w/ fix 3218
average (ms)	2.708554	0.787365	0.758712	0.741636	0.737616
stdev	0.227331	0.06549	0.047685	0.044937	0.048045

All variations have approximately the same performance. The proposed implementation is more stable (but not by a huge amount) The big difference is that 82262b9 cache miss count is roughly twice as big the proposed implementation.

Using perf to monitor the process over 120 seconds each. I measured:

82262b9: 393 caches misses in pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter
proposed implementation: only 107 cache misses

So even though the individual performance is roughly the same, there is a big difference in term of cache performance (which impacts the performance of other processes).

VRichardJP · 2023-06-01T12:53:13Z

Rebased the code, then benchmarked again following protocol described here #3269 (comment)

The rosbag used for benchmark this time is one of autoware's demo rosbag , whose pointcloud data scans contain each ~218000 points (from a VLP 128 I guess?). The rosbag is pretty short though (~30 seconds).

	cache friendly PR	current
average (ms)	4.26097894736842	6.2111043956044
stddev	0.756442711825567	1.58342325326683

This rosbag is maybe a bit unfair for the current implementation, because the data is really heavier. But it also highlights the impact of cache-misses.

Cache-misses on current implementation:

Samples: 28K of event 'cache-misses', Event count (approx.): 829442212                                                                                                                                                                       
Overhead       Samples  Command          Shared Object                                              Symbol                                                                                                                                   
  18.78%          3246  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter
  14.19%          2658  component_conta  libc.so.6                                                  [.] __memset_avx2_unaligned_erms
  11.17%          1595  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::CropBoxFilterComponent::faster_filter
   8.76%          1829  component_conta  libc.so.6                                                  [.] __memmove_avx_unaligned_erms
   4.46%          1418  component_conta  libc.so.6                                                  [.] pthread_mutex_lock@@GLIBC_2.2.5
   2.49%           374  component_conta  libvelodyne_rawdata.so                                     [.] velodyne_rawdata::RawData::unpack_vls128

Same as my previous report, the float azimuth_diff = next_azimuth - current_azimuth line is responsible for mosts cache-misses:

       │      walk_last_idx = idx;                                                                                                                                                                                                          ▒
  0.18 │        mov      %esi,%r11d                                                                                                                                                                                                         ▒
  0.67 │        mov      %esi,-0x8e8(%rbp)                                                                                                                                                                                                  ▒
       │      float azimuth_diff = next_azimuth - current_azimuth;                                                                                                                                                                          ▒
  0.09 │        lea      0x0(%r13,%rdi,1),%rdx                                                                                                                                                                                              ◆
  0.12 │        movss    (%rdx,%r12,1),%xmm5                                                                                                                                                                                                ▒
 36.99 │        subss    0x0(%r13,%rcx,1),%xmm5                                                                                                                                                                                             ▒
       │      azimuth_diff = azimuth_diff < 0.f ? azimuth_diff + 36000.f : azimuth_diff;                                                                                                                                                    ▒
  2.93 │        comiss   %xmm5,%xmm2                                                                                                                                                                                                        ▒
  0.62 │      ↓ jbe      7e6                                                                                                                                                                                                                ▒
       │        addss    %xmm4,%xmm5

Now with the current PR, the number of cache-miss in RingOutlierFilterComponent::faster_filter drops a lot.:

Samples: 25K of event 'cache-misses', Event count (approx.): 794706274                                                                                                                                                                       
Overhead       Samples  Command          Shared Object                                              Symbol                                                                                                                                   
  16.73%          2925  component_conta  libc.so.6                                                  [.] __memset_avx2_unaligned_erms
  12.55%          1703  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::CropBoxFilterComponent::faster_filter
  10.78%          1979  component_conta  libc.so.6                                                  [.] __memmove_avx_unaligned_erms
  10.13%          1888  component_conta  libpointcloud_preprocessor_filter.so                       [.] pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter
   4.32%          1233  component_conta  libc.so.6                                                  [.] pthread_mutex_lock@@GLIBC_2.2.5
   2.84%           379  component_conta  libvelodyne_rawdata.so                                     [.] velodyne_rawdata::RawData::unpack_vls128

Note that there are more cache-misses than before my rebate. I don't know if it's related to the point cloud size or if it is because the code can't memcpy that much anymore.

It is unclear where exactly is the biggest source of cache miss, I feel it is in one of the lambda helpers (e.g. getPointAzimuth()):

       │      pointcloud_preprocessor::RingOutlierFilterComponent::faster_filter(std::shared_ptr<sensor_msgs::msg::PointCloud2_<std::allocator<void> > const> const&, std::shared_ptr<std::vector<int, std::allocator<int> > > const&, senso▒
       │      ↓ je       1320                                                                                                                                                                                                               ▒
  0.41 │        mov      -0x9d8(%rbp),%rax                                                                                                                                                                                                  ▒
  0.73 │        mov      0x8(%rdx),%ecx                                                                                                                                                                                                     ▒
 10.77 │        mov      (%rdx),%rbx                                                                                                                                                                                                        ▒
  0.34 │        mov      (%rdx,%rax,1),%eax                                                                                                                                                                                                 ▒
  6.12 │        shl      $0x20,%rax                                                                                                                                                                                                         ▒
  1.03 │        or       %rcx,%rax                                                                                                                                                                                                          ▒
  1.97 │        mov      -0x9c0(%rbp),%rcx                                                                                                                                                                                                  ▒

Note that these functions could be made faster if Autoware enforced PointCloud2 data field format (see #2618 (comment)).

The output data "looks ok", but I don't have any real test. I don't have many rosbag either, so any help would be appreciated.

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> style(pre-commit): autofix fix(ring_outlier_filter): cleanup code with ranges Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> [breaking] fix autoware point type padding for faster memory access can memcpy between input data buffer and PointXYZI* make assumption on memory layout for faster data fetch misc optimization (reordering, constexpr, etc) Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> style(pre-commit): autofix comment limitations Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> feat(ring_outlier_filter): add accurate isCluster impl Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> style(pre-commit): autofix fix autowarefoundation#3218 Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> cleaning Signed-off-by: Vincent Richard <richard-v@macnica.co.jp> style(pre-commit): autofix

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

sykwer · 2023-06-06T04:49:49Z

Are performance improvements measured without the influence of the operating system? It not so, could you please refer to this document?
https://autowarefoundation.github.io/autoware-documentation/main/how-to-guides/integrating-autoware/tuning-parameters-and-performance/evaluating-real-time-performance/#performance-measurement

VRichardJP · 2023-06-07T11:55:07Z

@sykwer Very interesting reading. https://github.com/sykwer/ros2_single_node_replayer and https://github.com/sykwer/pmu_analyzer look very nice.

I let you guess have not used these ressources for my setup ^^.

What I have done to reduce interferences:

only launch the pointcloud preprocessing pipeline (from the sample sensor kit)
only run rosbag and perf
used fixed CPU clock speed

So very far from compiling a custom kernel.

Regarding node isolation, I am not sure it makes sense in this case. In the original issue, I pointed out that cache performance was most likely an issue with the filter implementation. In a first attempt to quantify the problem, I used a rosbag to feed data directly to the filter. Doing so blows up the cache-misses count, because the input data is never in the cache. Even though cache access is made more efficient, it is difficult to see any improvement with this setup.

On the other hand, when Autoware is running, it is likely the input data is hot in the cache when the filter tries to access it. This is why I decided to run the whole preprocessing pipeline for my benchmark (the last 2 components could have worked as well). Not only that, but inefficient cache usage is most impactful when many nodes are running: when a cache line is invalidated, many processes are impacted, not just the one responsible for the cache miss. As I reported in the original issue, simply skipping the ring outlier filter on our vehicle reduced htop cpu usage by 20%... yet the filter was not using 20% of the cpu itself.

In any case, it seems that this PR is currently broken (from the last few rebase, but after my last benchmark). The ring index is not read properly anymore (or input data is invalid) and my code reports:

[component_container_mt-1] [WARN 1686136475.418574920] [sensing.lidar.top.ring_outlier_filter]: 59649 points had ring index over max_rings_num (128) and have been ignored. (faster_filter() at /home/sig/autoware/src/universe/autoware.universe/sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp:350)

I need to investigate that.

xmfcx · 2023-06-13T15:21:37Z

@VRichardJP I've set it to draft since you've reported that it stopped working correctly after the last rebase.

VRichardJP · 2023-06-16T03:05:11Z

@xmfcx Sorry I don't have much time to work on Autoware recently.
During my recent tests however, I have observed the ring field of ring outlier filter input cloud is filled with garbage data. Maybe I got unlucky in my rebase, but I don't think this was related to this PR.

mojomex · 2023-07-05T00:16:35Z

@VRichardJP I would like to troubleshoot the garbage data issue, and verify and complete the pull request if that is okay with you!

VRichardJP · 2023-07-05T04:14:52Z

@mojomex Sorry, I really can't find much free time to work on Autoware recently.
I have pretty much pushed everything and documented all the details, so feel free to take over this PR.

github-actions bot added type:documentation Creating or refining documentation. (auto-assigned) component:sensing Data acquisition from sensors, drivers, preprocessing. (auto-assigned) labels Apr 6, 2023

mitsudome-r requested a review from sykwer April 6, 2023 06:29

VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from ca5d15b to e3936e0 Compare April 7, 2023 05:26

github-actions bot added the component:common Common packages from the autoware-common repository. (auto-assigned) label Apr 7, 2023

VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from 2701df3 to 22e1aaf Compare April 10, 2023 23:27

VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from 22e1aaf to 0edd97e Compare April 12, 2023 00:30

VRichardJP mentioned this pull request Apr 19, 2023

On abusive use of reinterpret_cast UB #3215

Closed

3 tasks

VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from 85c7823 to 9a24f2c Compare May 22, 2023 08:30

VRichardJP mentioned this pull request Jun 1, 2023

ring_outlier_filter blows up my cache (and yours too) #3269

Closed

3 tasks

VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from 9a24f2c to 66ccb82 Compare June 1, 2023 12:52

github-actions bot removed the component:common Common packages from the autoware-common repository. (auto-assigned) label Jun 1, 2023

VRichardJP marked this pull request as ready for review June 1, 2023 12:53

VRichardJP requested review from amc-nu, miursh, yukkysaito and a team as code owners June 1, 2023 12:53

VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from cb23f0e to 346bbbd Compare June 1, 2023 13:10

VRichardJP changed the title ~~perf(ring_outlier_filter): an attempt of cache friendly impl (WIP)~~ perf(ring_outlier_filter): a cache friendly impl Jun 1, 2023

VRichardJP and others added 4 commits June 5, 2023 09:09

style(pre-commit): autofix

4f743a5

resize vector to data size

6858b4e

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

cleaning

b7e54f6

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

VRichardJP force-pushed the cache_friendly_ring_outlier_filter branch from 346bbbd to b7e54f6 Compare June 5, 2023 00:09

cleaner utilities impl

8d8117f

Signed-off-by: Vincent Richard <richard-v@macnica.co.jp>

VRichardJP mentioned this pull request Jun 5, 2023

distortion_corrector should update "azimuth" and "distance" field #3896

Closed

3 tasks

xmfcx self-requested a review June 5, 2023 12:17

xmfcx marked this pull request as draft June 13, 2023 15:21

mojomex mentioned this pull request Jul 6, 2023

perf(ring_outlier_filter): a cache friendly impl (continuation of VRichardJP's work) #4185

Merged

7 tasks

VRichardJP closed this Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(ring_outlier_filter): a cache friendly impl #3293

perf(ring_outlier_filter): a cache friendly impl #3293

VRichardJP commented Apr 6, 2023 •

edited

Loading

VRichardJP commented Apr 7, 2023 •

edited

Loading

codecov bot commented Apr 11, 2023 •

edited

Loading

VRichardJP commented Apr 13, 2023 •

edited

Loading

VRichardJP commented Jun 1, 2023 •

edited

Loading

sykwer commented Jun 6, 2023

VRichardJP commented Jun 7, 2023

xmfcx commented Jun 13, 2023

VRichardJP commented Jun 16, 2023

mojomex commented Jul 5, 2023

VRichardJP commented Jul 5, 2023

perf(ring_outlier_filter): a cache friendly impl #3293

perf(ring_outlier_filter): a cache friendly impl #3293

Conversation

VRichardJP commented Apr 6, 2023 • edited Loading

Description

Related links

Tests performed

Notes for reviewers

Pre-review checklist for the PR author

In-review checklist for the PR reviewers

Post-review checklist for the PR author

VRichardJP commented Apr 7, 2023 • edited Loading

codecov bot commented Apr 11, 2023 • edited Loading

Codecov Report

VRichardJP commented Apr 13, 2023 • edited Loading

VRichardJP commented Jun 1, 2023 • edited Loading

sykwer commented Jun 6, 2023

VRichardJP commented Jun 7, 2023

xmfcx commented Jun 13, 2023

VRichardJP commented Jun 16, 2023

mojomex commented Jul 5, 2023

VRichardJP commented Jul 5, 2023

VRichardJP commented Apr 6, 2023 •

edited

Loading

VRichardJP commented Apr 7, 2023 •

edited

Loading

codecov bot commented Apr 11, 2023 •

edited

Loading

VRichardJP commented Apr 13, 2023 •

edited

Loading

VRichardJP commented Jun 1, 2023 •

edited

Loading