Restructure buffer packing kernels #938

lroberts36 · 2023-09-07T23:18:35Z

PR Summary

This PR adds a third level of hierarchical parallelism to the buffer packing/unpacking kernels in SendBoundsBufs and SetBounds. The new innermost level is a ThreadVectorRange that goes over the i index we are packing over, which is contiguous in memory. In the new inner most loop, we access the variable and buffer via a raw pointer which presumably enables vectorization, reduces the number of required index calculations, and (maybe?) helps with cacheing.

This change gives a 30% speedup in 128^3 simulations in Riot with 32^3 blocks on CPU. There is a slight degradation in performance on GPUs though (5-6% for 32^3 and 64^3 blocks, based on @pdmullen's timings).

PR Checklist

Code passes cpplint
Code is formatted
Changes are summarized in CHANGELOG.md
CI has been triggered on Darwin for performance regression tests.
Docs build
(@lanl.gov employees) Update copyright on changed files

Yurlungur

Id like to hear other people's thoughts on the trade off between CPU and GPU performance, especially from @pgrete , but to my mind this is a win.

lroberts36 · 2023-09-11T18:09:30Z

@Yurlungur: Agreed about hearing people's opinions, that was part of the reason for putting up the PR. I know @pgrete looked at how the parallelism was structured in these kernels in the past, and I think the way it was written previously gave an approximately optimal balance between performance on different architectures. I am not sure if that still holds though after a number of changes have been made in the interim.

pgrete · 2023-09-18T15:33:31Z

I'm happy to revisit the kernel structure wrt performance.
IIRC the current setup was the one that favored GPUs versus CPUs, so I'm not surprised it give a boost for CPUs.
I'm currently looking at a performance comparison between AthenaPK and AthenaK (or to be more specific @pdmullen is) with the goal to gauge AMR/remeshing performance.
In that context I'd also like to test this change (on both AMD and Nvidia GPUs).
The numbers quote above: are those for the integrated performance or just for the specific kernels?

pdmullen · 2023-09-18T15:45:14Z

I'm happy to revisit the kernel structure wrt performance. IIRC the current setup was the one that favored GPUs versus CPUs, so I'm not surprised it give a boost for CPUs. I'm currently looking at a performance comparison between AthenaPK and AthenaK (or to be more specific @pdmullen is) with the goal to gauge AMR/remeshing performance. In that context I'd also like to test this change (on both AMD and Nvidia GPUs). The numbers quote above: are those for the integrated performance or just for the specific kernels?

Integrated performance.

pgrete · 2023-09-18T15:56:07Z

I'm happy to revisit the kernel structure wrt performance. IIRC the current setup was the one that favored GPUs versus CPUs, so I'm not surprised it give a boost for CPUs. I'm currently looking at a performance comparison between AthenaPK and AthenaK (or to be more specific @pdmullen is) with the goal to gauge AMR/remeshing performance. In that context I'd also like to test this change (on both AMD and Nvidia GPUs). The numbers quote above: are those for the integrated performance or just for the specific kernels?

Integrated performance.

Wow 😲
In that case I think this deserves a closer look before pulling the trigger (i.e., also looking at performance numbers of KHARMA and AthenaPK for few selected typical production setups in terms of block sizes and counts).

lroberts36 · 2023-09-19T15:33:05Z

Yeah, I am a little worried that we will find that we just need to write two separate kernels and switch between them based on compile time options.

bprather · 2023-09-21T18:18:14Z

So, testing on Chicoma CPU, using 64 1-thread processes on a 2D Orszag-Tang & 3D SANE torus problems with 32x32(x32) zone meshblocks, I find KHARMA is ~4M ZCPS with the new buffer packing, whereas it's about 4.5M ZCPS with the old buffer packing -- everything else completely identical. This is with the Cray compilers wrapping AOCC, using Kokkos "aggressive vectorization" but otherwise not fiddling with compile flags for myself.

Performance is fun!

pdmullen · 2023-09-21T19:33:34Z

So, testing on Chicoma CPU, using 64 1-thread processes on a 2D Orszag-Tang & 3D SANE torus problems with 32x32(x32) zone meshblocks, I find KHARMA is ~4M ZCPS with the new buffer packing, whereas it's about 4.5M ZCPS with the old buffer packing -- everything else completely identical. This is with the Cray compilers wrapping AOCC, using Kokkos "aggressive vectorization" but otherwise not fiddling with compile flags for myself.

Performance is fun!

if relevant, the numbers @lroberts36 reported used Intel 20.2.2 with flags -mavx2 -qopt-report=1 -qopt-report-phase=vec,loop -fno-math-errno -march=broadwell

bprather · 2023-09-21T20:00:45Z

I'll play with compilers a bit more -- certainly, neither of those numbers reflect what KHARMA can really pull on the hardware, so it's not a realistic enough test case to be damning. Just, I'd bet that for many non-vectorized cases this is going to drag down performance.

I'm surprised the ICC compile line was that simple -- from the KNL days, I know there are some spicy options for classic ICC that can convince it to (attempt to) vectorize just about anything, so maybe there's a way to vectorize the existing code for Riot's case, but still keep good performance for noobs using LLVM-based compilers.

From iharm3d's code it looks like I added at least -funroll-loops -ipo -qopt-prefetch=5 and also to use the -x option, or march=native -mtune=native together. I think prefetch level 5 is KNL-only, but the option in general might be worth looking at.

bprather · 2023-09-25T16:20:37Z

When compiling with Intel classic, KHARMA sees about a 10% improvement with the new buffers on a Skylake system (2.2M to 2.45M). Interestingly it also improves under IntelLLVM (2.25M to 2.35M), ~~and dramatically with AOCC (2.0M to 2.4M) on the same node~~. This is using one process split with OpenMP, but I tested and the trend holds with 1 proc/core at least with AOCC.

I'll look back at what might have happened on Chicoma, but I think it's safe to call it an outlier FWIW. I can also provide GPU numbers from e.g. Frontier if this would be useful. Nvidia machines are hard due to the # of clean builds required and #922

EDIT haha did I say AOCC improved? I meant now it crashes when compiling the SetBuffer code. It recommends a bug report to AMD with the location/backtrace below:

1.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:256:35: current parser token ';'                                                        
2.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:42:1: parsing namespace 'parthenon'                                                     
3.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:225:59: parsing function body 'parthenon::SetBounds'
4.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:225:59: in compound statement ('{}')                                                    
5.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:241:7 <Spelling=/vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/partheno
n/external/Kokkos/core/src/Kokkos_Macros.hpp:119:23>: lambda expression parsing
6.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:241:56: in compound statement ('{}')
7.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:244:75: in compound statement ('{}')
8.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:247:67: in compound statement ('{}')                                                    
9.      /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:250:17: lambda expression parsing
11.     /vast/home/bprather/Code/cpu-perf-tests/kharma-newbufs/external/parthenon/src/bvals/comms/boundary_communication.cpp:250:36: in compound statement ('{}')

I didn't notice and recorded the 2.4M number from an Intel LLVM binary, so also: take these measurements with a +/- 0.05M ZCPS grain of salt

EDIT EDIT this seems to be fixed in AOCC 3.2 since the code compiles fine on Chicoma. Just, I can't provide any AOCC numbers except over there.

bprather · 2023-09-25T21:00:28Z

Re-testing the most recent KHARMA code on Chicoma carefully, there's a slight speedup (5%) to the new buffers on CPUs -- this is still with AOCC 3.2, but a newer version of KHARMA and under more realistic problem/conditions. Overall, this brings KHARMA into line with a milder version of what Riot is seeing: uniformly higher CPU performance, though not by relevant amounts vs Riot.

However, when running on Nvidia GPUs with one block per rank (Milan/A100), we actually seem to get better performance with the new kernels, on the order of 15%. So, pulling this one. (This is with NVCC/NVC++, other Chicoma defaults).

EDIT to change conclusions due to my shoddy testing practices. To be fair this was the first I got KHARMA working with the new device-side buffers, so had no idea what to expect

pdmullen · 2023-09-25T21:45:09Z

so I see two data points that favor the new buffer packing design... anybody willing to test how AthenaPK fares?

pgrete · 2023-09-26T12:17:30Z

Here are the AthenaPK numbers (I just tested uniform grids on a single node with different block sizes to focus on buffer packing).

Intel (2× Intel Xeon Platinum 8168 CPU, 2× 24 cores, 2,7 GHz on JUWELS)

Compiler: Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)
Uniform mesh 256x256x192 and block sizes of 64^3, 32^3, and 16^3

With current develop (in 10^8 zcs)

Block size:	16	32	64
pack_size = -1	0.117	0.196	0.276
pack_size = 1	0.19	0.236	0.277

With this branch:

Block size:	16	32	64
pack_size = -1	0.158	0.248	0.32
pack_size = 1	0.259	0.291	0.324

Relative improvement with this branch:

	16	32	64
pack_size = -1	1.35	1.27	1.16
pack_size = 1	1.36	1.23	1.17

PS: An interesting observation I made is that the new Intel compiler actually produces faster code than the legacy one.
Last time I checked this was not the case.

GPU (4x Nvidia A100 on JUWELS Booster)

Compiler: cuda_11.7.r11.7/compiler.31294372_0
Uniform mesh 512x256x256 and block sizes of 32^3, 64^3 and 128^3

With current develop (in 10^8 zcs)

Block size:	32	64	128
pack_size = -1	2.57	5.99	6
pack_size = 1	0.584	1.61	3.34

With this branch:

Block size:	32	64	128
pack_size = -1	2.48	5.91	6.39
pack_size = 1	0.628	1.82	3.8

Relative improvement with this branch:

	32	64	128
pack_size = -1	0.96	0.99	1.06
pack_size = 1	1.08	1.13	1.14

Bottom line

AthenaPK also experiences a 30% aggregate performance improvement at small meshblocks for Intel on CPU and on A100 only for very small meshblocks it doesn't result in an improvement.
I'm still surprised by this and am not sure yet what the bigger take away here is (the smaller one is that the change is also a win for us and we should go forward with it).

bprather · 2023-09-26T16:12:46Z

1.14

This is what I was seeing on A100s for big blocks too (128x64x64 I think). This is a non-trivial boost, at least for us. What's the speedup here, do we think? Do we have any explanation? Should I start pulling pointers in KHARMA so I can perform better on GPUs?

pgrete · 2023-09-27T13:27:58Z

1.14

This is what I was seeing on A100s for big blocks too (128x64x64 I think). This is a non-trivial boost, at least for us. What's the speedup here, do we think? Do we have any explanation? Should I start pulling pointers in KHARMA so I can perform better on _GPU_s?

The answer is probably more complex.
I did some more testing (also to older versions, e.g., an AthenaPK one from about 14 month ago) on the 4x A100 machine.

The following are kernel/region timers (using the space-time profiling tool) comparing the old version (a-m) versus current Parthenon develop dev versus this branch buf for 128^3 blocks

and 64^3 blocks

and 32^3 blocks

So at 128^3 blocks current dev is slower for the SetBoundaries kernel than both old and this branch (with old being the fastest), whereas for 32^3 to 64^3 it seems to be the other way round (current dev is the fastest).
I'm not sure what to make of those numbers yet and also noticed quite a bit of per cycle variability on the node I used for testing.

One takeaway I already have is that I should bump up performance regression testing on my todo list...

lroberts36 · 2023-09-27T15:28:56Z

@pgrete: What is the hash of the old version of Parthenon you were looking at? Did it have sparse related things turned on?

pdmullen · 2023-09-28T16:58:03Z

@pgrete, we'd like to get this in for our downstream codes --- it seems that everybody benefits.

Perhaps we don't completely understand why yet, but should that be a prerequisite for getting this merged?

pdmullen

LGTM

pgrete · 2023-09-29T13:01:50Z

@pgrete: What is the hash of the old version of Parthenon you were looking at? Did it have sparse related things turned on?

This was with Parthenon version 5b6bb61906f7c278f9724ee9f38e79dee8707098

Sparse is still compile time disabled PARTHENON_DISABLE_SPARSE in AthenaPK by default.

pgrete · 2023-09-29T13:02:54Z

@pgrete, we'd like to get this in for our downstream codes --- it seems that everybody benefits.

Perhaps we don't completely understand why yet, but should that be a prerequisite for getting this merged?

Sorry if I made that impression.
Given that this seems to be a broad win I'm definitely in favor of including the fix (even, as you say, we don't fully understand it yet)

pgrete · 2023-09-29T13:04:00Z

src/bvals/comms/boundary_communication.cpp

+
+                Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni),
+                                     [&](int m) { buf[m] = var[m]; });
+


Do we need a team barrier here (as we use buf[m] in the following par_reduce?

I thought that we didn't because each element of buf was only being accessed by the same thread that set it. Maybe I am misunderstanding the Kokkos model though. If we have to add a team barrier I think we should just combine the parallel_reduce with the parallel_for.

pgrete · 2023-09-29T13:05:47Z

src/bvals/comms/boundary_communication.cpp

+                Real *var = &bnd_info(b).var(iel, t, u, v, k, j, i);
+                Real *buf = &bnd_info(b).buf(idx * Ni + idx_offset);
+
+                Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni),


Out of curiosity, have you tried a TeamVectorRange and/or the new TeamMDRange?

I tried TeamVectorRange (as a single level of parallelism), but that seemed to be slower at least on CPU. I have not experimented with TeamMDRange, I think it would be interesting to try this out.

As you pointed out, it is not clear what the origin of the performance boost from the changes is. My original thought was that it would promote vectorization on CPU when the i-direction, which would likely be beneficial when the i-direction is not nghost. That being said, it is also possible that there is a little benefit from the fact that we perform the indexing calculations significantly fewer times with the current setup.

Is there any documentation for TeamMDRange? Can't seem to find any. What does it do? @lroberts36 are you going to try this, or should we just merge what we have and try this out later?

@jdolence: I believe TeamMDRange should give us a Kokkos native way to have an inner loop go over multiple indices (rather than having the loop go over a single index and then calculate the multi-dimensional indices by hand as we do now). I don't necessarily see why TeamMDRange would give a performance boost, but that doesn't mean we shouldn't test it at some point. I would leave that for a future PR though.

pgrete · 2023-09-29T13:15:20Z

src/bvals/comms/boundary_communication.cpp

+                    Kokkos::LOr<bool, parthenon::DevMemSpace>(mnon_zero));
+
+                lnon_zero = lnon_zero || mnon_zero;


Do we need a barrier here? I have't looked into the blocking/nonblocking nature of inner par reduces to device memory (though they'd always be to device memory) so it might be a semantic question.

I thought we didn't because I thought the inner parallel_reduce served as a barrier itself (i.e. I thought mnon_zero was set on return). But honestly I couldn't find much documentation about how this was supposed to behave.

pgrete · 2023-09-29T13:16:06Z

src/bvals/comms/boundary_communication.cpp

+                Kokkos::parallel_reduce(
+                    Kokkos::ThreadVectorRange<>(team_member, Ni),


Why is this reduce split from the above for? They go over the same range and only use local information, so it could all be in one reduce, doesn't it?

It could be all one reduce. I think I had split it up to see what the impact of removing the reduction was (since it is unecessary for dense variables). This may be related to why you see a little worse performance with newer versions of Parthenon compared to versions before my changes to sparse related infrastructure. There were a number of things that were turned off via macros at compile time in the boundary packing related to sparse that are not now (but we could go back to including that if that is the source of your performance difference).

pgrete · 2023-09-29T13:19:30Z

src/bvals/comms/boundary_communication.cpp

-                                         bnd_info(b).buf(idx + idx_offset);
-                                 });
+            Kokkos::parallel_for(
+                Kokkos::TeamThreadRange<>(team_member, idxer.size() / Ni),


As above have you tried a TeamVectorRange and/or the new TeamMDRange?

As above, I tried TeamVectorRange but not TeamMDRange

pgrete · 2023-10-06T12:53:07Z

I just checked the "split" version (branch roberts36/speedup-buffer-kernel-split) that @lroberts36 suggested yesterday during the call.
Relative difference to the changes in this PR is as follow (on 4x A100)

	32	64	128
pack_size = -1	1.00	1.00	1.01
pack_size = 1	1.02	1.02	1.02

Everything is within the noise so it doesn't make a difference.

jdolence · 2023-10-06T14:41:11Z

@pgrete with your approval I assume we're good to merge, right?

jdolence · 2023-10-06T15:29:20Z

Based on approvals, going ahead with enabling auto-merge.

lroberts36 added 5 commits September 7, 2023 15:25

Switch to three levels of parallelism

ecad3b4

Remove unecessary initialization of host nonzero flags

6e7467e

Add template in a couple places, format, lint

3d41023

Bug fix

82d60c8

Changelog

7f1d629

lroberts36 requested review from Yurlungur, pgrete, jdolence and pdmullen and removed request for Yurlungur September 7, 2023 23:21

Yurlungur approved these changes Sep 10, 2023

View reviewed changes

Merge branch 'develop' into lfroberts36/speedup-buffer-kernel

3c71701

pdmullen approved these changes Sep 28, 2023

View reviewed changes

pgrete reviewed Sep 29, 2023

View reviewed changes

pgrete approved these changes Oct 6, 2023

View reviewed changes

Merge branch 'develop' into lfroberts36/speedup-buffer-kernel

5be9d17

jdolence enabled auto-merge October 6, 2023 15:29

jdolence merged commit 8b49d77 into develop Oct 6, 2023


		Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni),
		[&](int m) { buf[m] = var[m]; });

		Kokkos::LOr<bool, parthenon::DevMemSpace>(mnon_zero));

		lnon_zero = lnon_zero \|\| mnon_zero;

		Kokkos::parallel_reduce(
		Kokkos::ThreadVectorRange<>(team_member, Ni),

Restructure buffer packing kernels #938

Restructure buffer packing kernels #938

Conversation

lroberts36 commented Sep 7, 2023 • edited Loading

PR Summary

PR Checklist

Yurlungur left a comment

Choose a reason for hiding this comment

lroberts36 commented Sep 11, 2023

pgrete commented Sep 18, 2023

pdmullen commented Sep 18, 2023

pgrete commented Sep 18, 2023

lroberts36 commented Sep 19, 2023

bprather commented Sep 21, 2023

pdmullen commented Sep 21, 2023 • edited Loading

bprather commented Sep 21, 2023

bprather commented Sep 25, 2023 • edited Loading

bprather commented Sep 25, 2023 • edited Loading

pdmullen commented Sep 25, 2023

pgrete commented Sep 26, 2023

Intel (2× Intel Xeon Platinum 8168 CPU, 2× 24 cores, 2,7 GHz on JUWELS)

GPU (4x Nvidia A100 on JUWELS Booster)

Bottom line

bprather commented Sep 26, 2023

pgrete commented Sep 27, 2023

lroberts36 commented Sep 27, 2023

pdmullen commented Sep 28, 2023

pdmullen left a comment

Choose a reason for hiding this comment

pgrete commented Sep 29, 2023

pgrete commented Sep 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdolence Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgrete commented Oct 6, 2023

jdolence commented Oct 6, 2023

jdolence commented Oct 6, 2023

lroberts36 commented Sep 7, 2023 •

edited

Loading

pdmullen commented Sep 21, 2023 •

edited

Loading

bprather commented Sep 25, 2023 •

edited

Loading

bprather commented Sep 25, 2023 •

edited

Loading

jdolence Oct 2, 2023 •

edited

Loading