-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU nbody improvements and more benchmarks #133
Conversation
Thanks for providing the plots together with the PR. The current results are saying we need to use VC and that I like to have it integrated into PIConGPU ❤️ |
What does |
I found indeed that the best version is using either Vc or vector intrinsics. They generate close to optimal assembly. The caveat with Vc is though that it is barely maintained at the moment, because all effort is focues on std::simd. We will see how that goes. But it is definitely worth to use a SIMD library!
All benchmarks are indeed single threaded, because I focused on pure single thread performance. The last 2 benchmarks are using the parallel STL, but just for comparison and as a proof, that good vector assembly will then also work great in a multithreaded fashion. However, libstdc++'s parallel STL implementation is weird. It worked far better on Windows ;) That's why there is also an OpenMP version (commented out), which works better on Linux. I also think that maybe the parallel STL does not parallelize if the workload is smallish (last benchmark on the AMD system). In the end I should have a separate benchmark for the single threaded and parallel Vc version with a bigger workload. |
If you are on Linux you can check with |
Thx! I did a few more experiments, also with SMT disabled. But it looks like it is an architectural effect, because the AMD CPU contains two chiplets and as soon as both are involved in computation the memory access performance drops significantly. Since the initial memory allocation is 1 single block from thread 0, this NUMA effect makes sense. |
After a lot of work on the CUDA nbody, I added some of the optimizations here as well. This mainly concerns using accumulator variables.
Also, more versions are run and plotted.