Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU nbody improvements and more benchmarks #133

Merged
merged 11 commits into from
Dec 11, 2020

Conversation

bernhardmgruber
Copy link
Member

After a lot of work on the CUDA nbody, I added some of the optimizations here as well. This mainly concerns using accumulator variables.
Also, more versions are run and plotted.

@bernhardmgruber
Copy link
Member Author

bernhardmgruber commented Dec 10, 2020

Here is a benchmark on an AMD Ryzen 9 5950X:
image

And on an Intel(R) Core(TM) i7-7820X:
image

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Dec 10, 2020

Here is a benchmark on an AMD Ryzen 9 5950X:
...

Thanks for providing the plots together with the PR. The current results are saying we need to use VC and that I like to have it integrated into PIConGPU ❤️

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Dec 10, 2020

What does *parallel mean?
Are all benchmark running on a single core except those with parallel?
If so, how do you parallelize?

@bernhardmgruber
Copy link
Member Author

Here is a benchmark on an AMD Ryzen 9 5950X:
...

Thanks for providing the plots together with the PR. The current results are saying we need to use VC and that I like to have it integrated into PIConGPU ❤️

I found indeed that the best version is using either Vc or vector intrinsics. They generate close to optimal assembly. The caveat with Vc is though that it is barely maintained at the moment, because all effort is focues on std::simd. We will see how that goes. But it is definitely worth to use a SIMD library!

What does *parallel mean?
Are all benchmark running on a single core except those with parallel?
If so, how do you parallelize?

All benchmarks are indeed single threaded, because I focused on pure single thread performance. The last 2 benchmarks are using the parallel STL, but just for comparison and as a proof, that good vector assembly will then also work great in a multithreaded fashion. However, libstdc++'s parallel STL implementation is weird. It worked far better on Windows ;) That's why there is also an OpenMP version (commented out), which works better on Linux. I also think that maybe the parallel STL does not parallelize if the workload is smallish (last benchmark on the AMD system). In the end I should have a separate benchmark for the single threaded and parallel Vc version with a bigger workload.

@bernhardmgruber
Copy link
Member Author

On the AMD 16 core, fastest version with different number of threads:
image

I ran with GOMP_CPU_AFFINITY='0-30:2,1-31:2': Strong scaling in the beginning. I have the feeling the memory subsystem saturates at some point. And I have no clue what happens at 32 threads (hyperthreading). But I will stop investigating here for now, I want to focus mostly on single thread performance for now.

@psychocoderHPC
Copy link
Member

I ran with GOMP_CPU_AFFINITY='0-30:2,1-31:2': Strong scaling in the beginning. I have the feeling the memory subsystem saturates at some point. And I have no clue what happens at 32 threads (hyperthreading). But I will stop investigating here for now, I want to focus mostly on single thread performance for now.

If you are on Linux you can check with hwloc-ls who the OS is distributing the threads to the hardware recources.

@bernhardmgruber
Copy link
Member Author

Thx! I did a few more experiments, also with SMT disabled. But it looks like it is an architectural effect, because the AMD CPU contains two chiplets and as soon as both are involved in computation the memory access performance drops significantly. Since the initial memory allocation is 1 single block from thread 0, this NUMA effect makes sense.

@bernhardmgruber bernhardmgruber merged commit 18bcdfb into alpaka-group:develop Dec 11, 2020
@bernhardmgruber bernhardmgruber deleted the nbody branch December 11, 2020 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants