CPU nbody improvements and more benchmarks #133

bernhardmgruber · 2020-12-10T11:47:35Z

After a lot of work on the CUDA nbody, I added some of the optimizations here as well. This mainly concerns using accumulator variables.
Also, more versions are run and plotted.

…variables

bernhardmgruber · 2020-12-10T11:57:08Z

Here is a benchmark on an AMD Ryzen 9 5950X:

And on an Intel(R) Core(TM) i7-7820X:

psychocoderHPC · 2020-12-10T18:19:16Z

Here is a benchmark on an AMD Ryzen 9 5950X:
...

Thanks for providing the plots together with the PR. The current results are saying we need to use VC and that I like to have it integrated into PIConGPU ❤️

psychocoderHPC · 2020-12-10T18:21:16Z

What does *parallel mean?
Are all benchmark running on a single core except those with parallel?
If so, how do you parallelize?

bernhardmgruber · 2020-12-11T10:03:28Z

Here is a benchmark on an AMD Ryzen 9 5950X:
...

Thanks for providing the plots together with the PR. The current results are saying we need to use VC and that I like to have it integrated into PIConGPU ❤️

I found indeed that the best version is using either Vc or vector intrinsics. They generate close to optimal assembly. The caveat with Vc is though that it is barely maintained at the moment, because all effort is focues on std::simd. We will see how that goes. But it is definitely worth to use a SIMD library!

What does *parallel mean?
Are all benchmark running on a single core except those with parallel?
If so, how do you parallelize?

All benchmarks are indeed single threaded, because I focused on pure single thread performance. The last 2 benchmarks are using the parallel STL, but just for comparison and as a proof, that good vector assembly will then also work great in a multithreaded fashion. However, libstdc++'s parallel STL implementation is weird. It worked far better on Windows ;) That's why there is also an OpenMP version (commented out), which works better on Linux. I also think that maybe the parallel STL does not parallelize if the workload is smallish (last benchmark on the AMD system). In the end I should have a separate benchmark for the single threaded and parallel Vc version with a bigger workload.

bernhardmgruber · 2020-12-11T14:22:15Z

On the AMD 16 core, fastest version with different number of threads:

I ran with GOMP_CPU_AFFINITY='0-30:2,1-31:2': Strong scaling in the beginning. I have the feeling the memory subsystem saturates at some point. And I have no clue what happens at 32 threads (hyperthreading). But I will stop investigating here for now, I want to focus mostly on single thread performance for now.

psychocoderHPC · 2020-12-11T14:53:39Z

I ran with GOMP_CPU_AFFINITY='0-30:2,1-31:2': Strong scaling in the beginning. I have the feeling the memory subsystem saturates at some point. And I have no clue what happens at 32 threads (hyperthreading). But I will stop investigating here for now, I want to focus mostly on single thread performance for now.

If you are on Linux you can check with hwloc-ls who the OS is distributing the threads to the hardware recources.

bernhardmgruber · 2020-12-11T15:31:11Z

Thx! I did a few more experiments, also with SMT disabled. But it looks like it is an architectural effect, because the AMD CPU contains two chiplets and as soon as both are involved in computation the memory access performance drops significantly. Since the initial memory allocation is 1 single block from thread 0, this NUMA effect makes sense.

bernhardmgruber added 5 commits December 9, 2020 18:34

run LLAMA demo with all mappings and add AoSoA and SplitMapping examples

bc1dee2

use accumulator variables where meaningful and add to plots

cb55ef7

disable AoSoA tiled version

b00a1c7

add AoSoA LANES to diagram series names and increase clarity of used …

0d13555

…variables

add hostname to chart title

82b4238

bernhardmgruber added 4 commits December 10, 2020 14:38

unify manual AoSoA and LLAMA LANES

faed61f

simplify loop induction variable declaration

fffa65d

run AoSoA examples with multiple lanes

8b93372

work around MSVC bug

91ba11e

switch to OpenMP for parallelization and test multiple thread numbers

992ade4

add hint for thread pinning

cbe3814

bernhardmgruber merged commit 18bcdfb into alpaka-group:develop Dec 11, 2020

bernhardmgruber deleted the nbody branch December 11, 2020 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU nbody improvements and more benchmarks #133

CPU nbody improvements and more benchmarks #133

bernhardmgruber commented Dec 10, 2020

bernhardmgruber commented Dec 10, 2020 •

edited

Loading

psychocoderHPC commented Dec 10, 2020 •

edited

Loading

psychocoderHPC commented Dec 10, 2020 •

edited

Loading

bernhardmgruber commented Dec 11, 2020

bernhardmgruber commented Dec 11, 2020

psychocoderHPC commented Dec 11, 2020

bernhardmgruber commented Dec 11, 2020

CPU nbody improvements and more benchmarks #133

CPU nbody improvements and more benchmarks #133

Conversation

bernhardmgruber commented Dec 10, 2020

bernhardmgruber commented Dec 10, 2020 • edited Loading

psychocoderHPC commented Dec 10, 2020 • edited Loading

psychocoderHPC commented Dec 10, 2020 • edited Loading

bernhardmgruber commented Dec 11, 2020

bernhardmgruber commented Dec 11, 2020

psychocoderHPC commented Dec 11, 2020

bernhardmgruber commented Dec 11, 2020

bernhardmgruber commented Dec 10, 2020 •

edited

Loading

psychocoderHPC commented Dec 10, 2020 •

edited

Loading

psychocoderHPC commented Dec 10, 2020 •

edited

Loading