Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure buffer packing kernels #938

Merged
merged 7 commits into from
Oct 6, 2023

Conversation

lroberts36
Copy link
Collaborator

@lroberts36 lroberts36 commented Sep 7, 2023

PR Summary

This PR adds a third level of hierarchical parallelism to the buffer packing/unpacking kernels in SendBoundsBufs and SetBounds. The new innermost level is a ThreadVectorRange that goes over the i index we are packing over, which is contiguous in memory. In the new inner most loop, we access the variable and buffer via a raw pointer which presumably enables vectorization, reduces the number of required index calculations, and (maybe?) helps with cacheing.

This change gives a 30% speedup in 128^3 simulations in Riot with 32^3 blocks on CPU. There is a slight degradation in performance on GPUs though (5-6% for 32^3 and 64^3 blocks, based on @pdmullen's timings).

PR Checklist

  • Code passes cpplint
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@lroberts36 lroberts36 requested review from Yurlungur, pgrete, jdolence and pdmullen and removed request for Yurlungur September 7, 2023 23:21
Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Id like to hear other people's thoughts on the trade off between CPU and GPU performance, especially from @pgrete , but to my mind this is a win.

@lroberts36
Copy link
Collaborator Author

@Yurlungur: Agreed about hearing people's opinions, that was part of the reason for putting up the PR. I know @pgrete looked at how the parallelism was structured in these kernels in the past, and I think the way it was written previously gave an approximately optimal balance between performance on different architectures. I am not sure if that still holds though after a number of changes have been made in the interim.

@pgrete
Copy link
Collaborator

pgrete commented Sep 18, 2023

I'm happy to revisit the kernel structure wrt performance.
IIRC the current setup was the one that favored GPUs versus CPUs, so I'm not surprised it give a boost for CPUs.
I'm currently looking at a performance comparison between AthenaPK and AthenaK (or to be more specific @pdmullen is) with the goal to gauge AMR/remeshing performance.
In that context I'd also like to test this change (on both AMD and Nvidia GPUs).
The numbers quote above: are those for the integrated performance or just for the specific kernels?

@pdmullen
Copy link
Collaborator

I'm happy to revisit the kernel structure wrt performance. IIRC the current setup was the one that favored GPUs versus CPUs, so I'm not surprised it give a boost for CPUs. I'm currently looking at a performance comparison between AthenaPK and AthenaK (or to be more specific @pdmullen is) with the goal to gauge AMR/remeshing performance. In that context I'd also like to test this change (on both AMD and Nvidia GPUs). The numbers quote above: are those for the integrated performance or just for the specific kernels?

Integrated performance.

@pgrete
Copy link
Collaborator

pgrete commented Sep 18, 2023

I'm happy to revisit the kernel structure wrt performance. IIRC the current setup was the one that favored GPUs versus CPUs, so I'm not surprised it give a boost for CPUs. I'm currently looking at a performance comparison between AthenaPK and AthenaK (or to be more specific @pdmullen is) with the goal to gauge AMR/remeshing performance. In that context I'd also like to test this change (on both AMD and Nvidia GPUs). The numbers quote above: are those for the integrated performance or just for the specific kernels?

Integrated performance.

Wow 😲
In that case I think this deserves a closer look before pulling the trigger (i.e., also looking at performance numbers of KHARMA and AthenaPK for few selected typical production setups in terms of block sizes and counts).

@lroberts36
Copy link
Collaborator Author

Yeah, I am a little worried that we will find that we just need to write two separate kernels and switch between them based on compile time options.

@bprather
Copy link
Collaborator

So, testing on Chicoma CPU, using 64 1-thread processes on a 2D Orszag-Tang & 3D SANE torus problems with 32x32(x32) zone meshblocks, I find KHARMA is ~4M ZCPS with the new buffer packing, whereas it's about 4.5M ZCPS with the old buffer packing -- everything else completely identical. This is with the Cray compilers wrapping AOCC, using Kokkos "aggressive vectorization" but otherwise not fiddling with compile flags for myself.

Performance is fun!

@pdmullen
Copy link
Collaborator

pdmullen commented Sep 21, 2023

So, testing on Chicoma CPU, using 64 1-thread processes on a 2D Orszag-Tang & 3D SANE torus problems with 32x32(x32) zone meshblocks, I find KHARMA is ~4M ZCPS with the new buffer packing, whereas it's about 4.5M ZCPS with the old buffer packing -- everything else completely identical. This is with the Cray compilers wrapping AOCC, using Kokkos "aggressive vectorization" but otherwise not fiddling with compile flags for myself.

Performance is fun!

if relevant, the numbers @lroberts36 reported used Intel 20.2.2 with flags -mavx2 -qopt-report=1 -qopt-report-phase=vec,loop -fno-math-errno -march=broadwell

@bprather
Copy link
Collaborator

I'll play with compilers a bit more -- certainly, neither of those numbers reflect what KHARMA can really pull on the hardware, so it's not a realistic enough test case to be damning. Just, I'd bet that for many non-vectorized cases this is going to drag down performance.

I'm surprised the ICC compile line was that simple -- from the KNL days, I know there are some spicy options for classic ICC that can convince it to (attempt to) vectorize just about anything, so maybe there's a way to vectorize the existing code for Riot's case, but still keep good performance for noobs using LLVM-based compilers.

From iharm3d's code it looks like I added at least -funroll-loops -ipo -qopt-prefetch=5 and also to use the -x option, or march=native -mtune=native together. I think prefetch level 5 is KNL-only, but the option in general might be worth looking at.

@bprather
Copy link
Collaborator

bprather commented Sep 25, 2023

When compiling with Intel classic, KHARMA sees about a 10% improvement with the new buffers on a Skylake system (2.2M to 2.45M). Interestingly it also improves under IntelLLVM (2.25M to 2.35M), and dramatically with AOCC (2.0M to 2.4M) on the same node. This is using one process split with OpenMP, but I tested and the trend holds with 1 proc/core at least with AOCC.

I'll look back at what might have happened on Chicoma, but I think it's safe to call it an outlier FWIW. I can also provide GPU numbers from e.g. Frontier if this would be useful. Nvidia machines are hard due to the # of clean builds required and #922

EDIT haha did I say AOCC improved? I meant now it crashes when compiling the SetBuffer code. It recommends a bug report to AMD with the location/backtrace below:

1.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:256:35: current parser token ';'                                                        
2.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:42:1: parsing namespace 'parthenon'                                                     
3.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:225:59: parsing function body 'parthenon::SetBounds'
4.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:225:59: in compound statement ('{}')                                                    
5.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:241:7 <Spelling=/vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/partheno
n/external/Kokkos/core/src/Kokkos_Macros.hpp:119:23>: lambda expression parsing
6.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:241:56: in compound statement ('{}')
7.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:244:75: in compound statement ('{}')
8.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:247:67: in compound statement ('{}')                                                    
9.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:250:17: lambda expression parsing
11.     /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:250:36: in compound statement ('{}')

I didn't notice and recorded the 2.4M number from an Intel LLVM binary, so also: take these measurements with a +/- 0.05M ZCPS grain of salt

EDIT EDIT this seems to be fixed in AOCC 3.2 since the code compiles fine on Chicoma. Just, I can't provide any AOCC numbers except over there.

@bprather
Copy link
Collaborator

bprather commented Sep 25, 2023

Re-testing the most recent KHARMA code on Chicoma carefully, there's a slight speedup (5%) to the new buffers on CPUs -- this is still with AOCC 3.2, but a newer version of KHARMA and under more realistic problem/conditions. Overall, this brings KHARMA into line with a milder version of what Riot is seeing: uniformly higher CPU performance, though not by relevant amounts vs Riot.

However, when running on Nvidia GPUs with one block per rank (Milan/A100), we actually seem to get better performance with the new kernels, on the order of 15%. So, pulling this one. (This is with NVCC/NVC++, other Chicoma defaults).

EDIT to change conclusions due to my shoddy testing practices. To be fair this was the first I got KHARMA working with the new device-side buffers, so had no idea what to expect

@pdmullen
Copy link
Collaborator

so I see two data points that favor the new buffer packing design... anybody willing to test how AthenaPK fares?

@pgrete
Copy link
Collaborator

pgrete commented Sep 26, 2023

Here are the AthenaPK numbers (I just tested uniform grids on a single node with different block sizes to focus on buffer packing).

Intel (2× Intel Xeon Platinum 8168 CPU, 2× 24 cores, 2,7 GHz on JUWELS)

  • Compiler: Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)
  • Uniform mesh 256x256x192 and block sizes of 64^3, 32^3, and 16^3

With current develop (in 10^8 zcs)

Block size: 16 32 64
pack_size = -1 0.117 0.196 0.276
pack_size = 1 0.19 0.236 0.277

With this branch:

Block size: 16 32 64
pack_size = -1 0.158 0.248 0.32
pack_size = 1 0.259 0.291 0.324

Relative improvement with this branch:

16 32 64
pack_size = -1 1.35 1.27 1.16
pack_size = 1 1.36 1.23 1.17

PS: An interesting observation I made is that the new Intel compiler actually produces faster code than the legacy one.
Last time I checked this was not the case.

GPU (4x Nvidia A100 on JUWELS Booster)

  • Compiler: cuda_11.7.r11.7/compiler.31294372_0
  • Uniform mesh 512x256x256 and block sizes of 32^3, 64^3 and 128^3

With current develop (in 10^8 zcs)

Block size: 32 64 128
pack_size = -1 2.57 5.99 6
pack_size = 1 0.584 1.61 3.34

With this branch:

Block size: 32 64 128
pack_size = -1 2.48 5.91 6.39
pack_size = 1 0.628 1.82 3.8

Relative improvement with this branch:

32 64 128
pack_size = -1 0.96 0.99 1.06
pack_size = 1 1.08 1.13 1.14

Bottom line

AthenaPK also experiences a 30% aggregate performance improvement at small meshblocks for Intel on CPU and on A100 only for very small meshblocks it doesn't result in an improvement.
I'm still surprised by this and am not sure yet what the bigger take away here is (the smaller one is that the change is also a win for us and we should go forward with it).

@bprather
Copy link
Collaborator

1.14

This is what I was seeing on A100s for big blocks too (128x64x64 I think). This is a non-trivial boost, at least for us. What's the speedup here, do we think? Do we have any explanation? Should I start pulling pointers in KHARMA so I can perform better on GPUs?

@pgrete
Copy link
Collaborator

pgrete commented Sep 27, 2023

1.14

This is what I was seeing on A100s for big blocks too (128x64x64 I think). This is a non-trivial boost, at least for us. What's the speedup here, do we think? Do we have any explanation? Should I start pulling pointers in KHARMA so I can perform better on _GPU_s?

The answer is probably more complex.
I did some more testing (also to older versions, e.g., an AthenaPK one from about 14 month ago) on the 4x A100 machine.

The following are kernel/region timers (using the space-time profiling tool) comparing the old version (a-m) versus current Parthenon develop dev versus this branch buf for 128^3 blocks
image
and 64^3 blocks
image
and 32^3 blocks
image

So at 128^3 blocks current dev is slower for the SetBoundaries kernel than both old and this branch (with old being the fastest), whereas for 32^3 to 64^3 it seems to be the other way round (current dev is the fastest).
I'm not sure what to make of those numbers yet and also noticed quite a bit of per cycle variability on the node I used for testing.

One takeaway I already have is that I should bump up performance regression testing on my todo list...

@lroberts36
Copy link
Collaborator Author

@pgrete: What is the hash of the old version of Parthenon you were looking at? Did it have sparse related things turned on?

@pdmullen
Copy link
Collaborator

@pgrete, we'd like to get this in for our downstream codes --- it seems that everybody benefits.

Perhaps we don't completely understand why yet, but should that be a prerequisite for getting this merged?

Copy link
Collaborator

@pdmullen pdmullen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pgrete
Copy link
Collaborator

pgrete commented Sep 29, 2023

@pgrete: What is the hash of the old version of Parthenon you were looking at? Did it have sparse related things turned on?

This was with Parthenon version 5b6bb61906f7c278f9724ee9f38e79dee8707098

Sparse is still compile time disabled PARTHENON_DISABLE_SPARSE in AthenaPK by default.

@pgrete
Copy link
Collaborator

pgrete commented Sep 29, 2023

@pgrete, we'd like to get this in for our downstream codes --- it seems that everybody benefits.

Perhaps we don't completely understand why yet, but should that be a prerequisite for getting this merged?

Sorry if I made that impression.
Given that this seems to be a broad win I'm definitely in favor of including the fix (even, as you say, we don't fully understand it yet)


Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni),
[&](int m) { buf[m] = var[m]; });

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a team barrier here (as we use buf[m] in the following par_reduce?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we didn't because each element of buf was only being accessed by the same thread that set it. Maybe I am misunderstanding the Kokkos model though. If we have to add a team barrier I think we should just combine the parallel_reduce with the parallel_for.

Real *var = &bnd_info(b).var(iel, t, u, v, k, j, i);
Real *buf = &bnd_info(b).buf(idx * Ni + idx_offset);

Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, have you tried a TeamVectorRange and/or the new TeamMDRange?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried TeamVectorRange (as a single level of parallelism), but that seemed to be slower at least on CPU. I have not experimented with TeamMDRange, I think it would be interesting to try this out.

As you pointed out, it is not clear what the origin of the performance boost from the changes is. My original thought was that it would promote vectorization on CPU when the i-direction, which would likely be beneficial when the i-direction is not nghost. That being said, it is also possible that there is a little benefit from the fact that we perform the indexing calculations significantly fewer times with the current setup.

Copy link
Collaborator

@jdolence jdolence Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any documentation for TeamMDRange? Can't seem to find any. What does it do? @lroberts36 are you going to try this, or should we just merge what we have and try this out later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdolence: I believe TeamMDRange should give us a Kokkos native way to have an inner loop go over multiple indices (rather than having the loop go over a single index and then calculate the multi-dimensional indices by hand as we do now). I don't necessarily see why TeamMDRange would give a performance boost, but that doesn't mean we shouldn't test it at some point. I would leave that for a future PR though.

Comment on lines +119 to +121
Kokkos::LOr<bool, parthenon::DevMemSpace>(mnon_zero));

lnon_zero = lnon_zero || mnon_zero;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a barrier here? I have't looked into the blocking/nonblocking nature of inner par reduces to device memory (though they'd always be to device memory) so it might be a semantic question.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we didn't because I thought the inner parallel_reduce served as a barrier itself (i.e. I thought mnon_zero was set on return). But honestly I couldn't find much documentation about how this was supposed to behave.

Comment on lines +114 to +115
Kokkos::parallel_reduce(
Kokkos::ThreadVectorRange<>(team_member, Ni),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this reduce split from the above for? They go over the same range and only use local information, so it could all be in one reduce, doesn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be all one reduce. I think I had split it up to see what the impact of removing the reduction was (since it is unecessary for dense variables). This may be related to why you see a little worse performance with newer versions of Parthenon compared to versions before my changes to sparse related infrastructure. There were a number of things that were turned off via macros at compile time in the boundary packing related to sparse that are not now (but we could go back to including that if that is the source of your performance difference).

bnd_info(b).buf(idx + idx_offset);
});
Kokkos::parallel_for(
Kokkos::TeamThreadRange<>(team_member, idxer.size() / Ni),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above have you tried a TeamVectorRange and/or the new TeamMDRange?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I tried TeamVectorRange but not TeamMDRange

@pgrete
Copy link
Collaborator

pgrete commented Oct 6, 2023

I just checked the "split" version (branch roberts36/speedup-buffer-kernel-split) that @lroberts36 suggested yesterday during the call.
Relative difference to the changes in this PR is as follow (on 4x A100)

32 64 128
pack_size = -1 1.00 1.00 1.01
pack_size = 1 1.02 1.02 1.02

Everything is within the noise so it doesn't make a difference.

@jdolence
Copy link
Collaborator

jdolence commented Oct 6, 2023

@pgrete with your approval I assume we're good to merge, right?

@jdolence
Copy link
Collaborator

jdolence commented Oct 6, 2023

Based on approvals, going ahead with enabling auto-merge.

@jdolence jdolence enabled auto-merge October 6, 2023 15:29
@jdolence jdolence merged commit 8b49d77 into develop Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants