-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure buffer packing kernels #938
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Id like to hear other people's thoughts on the trade off between CPU and GPU performance, especially from @pgrete , but to my mind this is a win.
@Yurlungur: Agreed about hearing people's opinions, that was part of the reason for putting up the PR. I know @pgrete looked at how the parallelism was structured in these kernels in the past, and I think the way it was written previously gave an approximately optimal balance between performance on different architectures. I am not sure if that still holds though after a number of changes have been made in the interim. |
I'm happy to revisit the kernel structure wrt performance. |
Integrated performance. |
Wow 😲 |
Yeah, I am a little worried that we will find that we just need to write two separate kernels and switch between them based on compile time options. |
So, testing on Chicoma CPU, using 64 1-thread processes on a 2D Orszag-Tang & 3D SANE torus problems with 32x32(x32) zone meshblocks, I find KHARMA is ~4M ZCPS with the new buffer packing, whereas it's about 4.5M ZCPS with the old buffer packing -- everything else completely identical. This is with the Cray compilers wrapping AOCC, using Kokkos "aggressive vectorization" but otherwise not fiddling with compile flags for myself. Performance is fun! |
if relevant, the numbers @lroberts36 reported used |
I'll play with compilers a bit more -- certainly, neither of those numbers reflect what KHARMA can really pull on the hardware, so it's not a realistic enough test case to be damning. Just, I'd bet that for many non-vectorized cases this is going to drag down performance. I'm surprised the ICC compile line was that simple -- from the KNL days, I know there are some spicy options for classic ICC that can convince it to (attempt to) vectorize just about anything, so maybe there's a way to vectorize the existing code for Riot's case, but still keep good performance for noobs using LLVM-based compilers. From iharm3d's code it looks like I added at least |
When compiling with Intel classic, KHARMA sees about a 10% improvement with the new buffers on a Skylake system (2.2M to 2.45M). Interestingly it also improves under IntelLLVM (2.25M to 2.35M), I'll look back at what might have happened on Chicoma, but I think it's safe to call it an outlier FWIW. I can also provide GPU numbers from e.g. Frontier if this would be useful. Nvidia machines are hard due to the # of clean builds required and #922 EDIT haha did I say AOCC improved? I meant now it crashes when compiling the
I didn't notice and recorded the 2.4M number from an Intel LLVM binary, so also: take these measurements with a +/- 0.05M ZCPS grain of salt EDIT EDIT this seems to be fixed in AOCC 3.2 since the code compiles fine on Chicoma. Just, I can't provide any AOCC numbers except over there. |
Re-testing the most recent KHARMA code on Chicoma carefully, there's a slight speedup (5%) to the new buffers on CPUs -- this is still with AOCC 3.2, but a newer version of KHARMA and under more realistic problem/conditions. Overall, this brings KHARMA into line with a milder version of what Riot is seeing: uniformly higher CPU performance, though not by relevant amounts vs Riot. However, when running on Nvidia GPUs with one block per rank (Milan/A100), we actually seem to get better performance with the new kernels, on the order of 15%. So, pulling this one. (This is with NVCC/NVC++, other Chicoma defaults). EDIT to change conclusions due to my shoddy testing practices. To be fair this was the first I got KHARMA working with the new device-side buffers, so had no idea what to expect |
so I see two data points that favor the new buffer packing design... anybody willing to test how AthenaPK fares? |
Here are the AthenaPK numbers (I just tested uniform grids on a single node with different block sizes to focus on buffer packing). Intel (2× Intel Xeon Platinum 8168 CPU, 2× 24 cores, 2,7 GHz on JUWELS)
With current
With this branch:
Relative improvement with this branch:
PS: An interesting observation I made is that the new Intel compiler actually produces faster code than the legacy one. GPU (4x Nvidia A100 on JUWELS Booster)
With current
With this branch:
Relative improvement with this branch:
Bottom lineAthenaPK also experiences a 30% aggregate performance improvement at small meshblocks for Intel on CPU and on A100 only for very small meshblocks it doesn't result in an improvement. |
This is what I was seeing on A100s for big blocks too (128x64x64 I think). This is a non-trivial boost, at least for us. What's the speedup here, do we think? Do we have any explanation? Should I start pulling pointers in KHARMA so I can perform better on GPUs? |
@pgrete: What is the hash of the old version of Parthenon you were looking at? Did it have sparse related things turned on? |
@pgrete, we'd like to get this in for our downstream codes --- it seems that everybody benefits. Perhaps we don't completely understand why yet, but should that be a prerequisite for getting this merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This was with Parthenon version 5b6bb61906f7c278f9724ee9f38e79dee8707098 Sparse is still compile time disabled |
Sorry if I made that impression. |
|
||
Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni), | ||
[&](int m) { buf[m] = var[m]; }); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a team barrier here (as we use buf[m]
in the following par_reduce
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we didn't because each element of buf
was only being accessed by the same thread that set it. Maybe I am misunderstanding the Kokkos
model though. If we have to add a team barrier I think we should just combine the parallel_reduce
with the parallel_for
.
Real *var = &bnd_info(b).var(iel, t, u, v, k, j, i); | ||
Real *buf = &bnd_info(b).buf(idx * Ni + idx_offset); | ||
|
||
Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, have you tried a TeamVectorRange
and/or the new TeamMDRange
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried TeamVectorRange
(as a single level of parallelism), but that seemed to be slower at least on CPU. I have not experimented with TeamMDRange
, I think it would be interesting to try this out.
As you pointed out, it is not clear what the origin of the performance boost from the changes is. My original thought was that it would promote vectorization on CPU when the i
-direction, which would likely be beneficial when the i
-direction is not nghost
. That being said, it is also possible that there is a little benefit from the fact that we perform the indexing calculations significantly fewer times with the current setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any documentation for TeamMDRange
? Can't seem to find any. What does it do? @lroberts36 are you going to try this, or should we just merge what we have and try this out later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdolence: I believe TeamMDRange
should give us a Kokkos native way to have an inner loop go over multiple indices (rather than having the loop go over a single index and then calculate the multi-dimensional indices by hand as we do now). I don't necessarily see why TeamMDRange
would give a performance boost, but that doesn't mean we shouldn't test it at some point. I would leave that for a future PR though.
Kokkos::LOr<bool, parthenon::DevMemSpace>(mnon_zero)); | ||
|
||
lnon_zero = lnon_zero || mnon_zero; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a barrier here? I have't looked into the blocking/nonblocking nature of inner par reduces to device memory (though they'd always be to device memory) so it might be a semantic question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we didn't because I thought the inner parallel_reduce
served as a barrier itself (i.e. I thought mnon_zero
was set on return). But honestly I couldn't find much documentation about how this was supposed to behave.
Kokkos::parallel_reduce( | ||
Kokkos::ThreadVectorRange<>(team_member, Ni), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this reduce split from the above for? They go over the same range and only use local information, so it could all be in one reduce, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be all one reduce. I think I had split it up to see what the impact of removing the reduction was (since it is unecessary for dense variables). This may be related to why you see a little worse performance with newer versions of Parthenon compared to versions before my changes to sparse related infrastructure. There were a number of things that were turned off via macros at compile time in the boundary packing related to sparse that are not now (but we could go back to including that if that is the source of your performance difference).
bnd_info(b).buf(idx + idx_offset); | ||
}); | ||
Kokkos::parallel_for( | ||
Kokkos::TeamThreadRange<>(team_member, idxer.size() / Ni), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above have you tried a TeamVectorRange
and/or the new TeamMDRange
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I tried TeamVectorRange
but not TeamMDRange
I just checked the "split" version (branch
Everything is within the noise so it doesn't make a difference. |
@pgrete with your approval I assume we're good to merge, right? |
Based on approvals, going ahead with enabling auto-merge. |
PR Summary
This PR adds a third level of hierarchical parallelism to the buffer packing/unpacking kernels in
SendBoundsBufs
andSetBounds
. The new innermost level is aThreadVectorRange
that goes over thei
index we are packing over, which is contiguous in memory. In the new inner most loop, we access the variable and buffer via a raw pointer which presumably enables vectorization, reduces the number of required index calculations, and (maybe?) helps with cacheing.This change gives a 30% speedup in 128^3 simulations in Riot with 32^3 blocks on CPU. There is a slight degradation in performance on GPUs though (5-6% for 32^3 and 64^3 blocks, based on @pdmullen's timings).
PR Checklist