-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure buffer packing kernels #938
Changes from 6 commits
ecad3b4
6e7467e
3d41023
82d60c8
7f1d629
3c71701
5be9d17
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -82,6 +82,7 @@ TaskStatus SendBoundBufs(std::shared_ptr<MeshData<Real>> &md) { | |
PARTHENON_DEBUG_REQUIRE(bnd_info.size() == nbound, "Need same size for boundary info"); | ||
auto &sending_nonzero_flags = cache.sending_non_zero_flags; | ||
auto &sending_nonzero_flags_h = cache.sending_non_zero_flags_h; | ||
|
||
Kokkos::parallel_for( | ||
"SendBoundBufs", | ||
Kokkos::TeamPolicy<>(parthenon::DevExecSpace(), nbound, Kokkos::AUTO), | ||
|
@@ -98,13 +99,26 @@ TaskStatus SendBoundBufs(std::shared_ptr<MeshData<Real>> &md) { | |
int idx_offset = 0; | ||
for (int iel = 0; iel < bnd_info(b).ntopological_elements; ++iel) { | ||
auto &idxer = bnd_info(b).idxer[iel]; | ||
const int Ni = idxer.template EndIdx<5>() - idxer.template StartIdx<5>() + 1; | ||
Kokkos::parallel_reduce( | ||
Kokkos::TeamThreadRange<>(team_member, idxer.size()), | ||
Kokkos::TeamThreadRange<>(team_member, idxer.size() / Ni), | ||
[&](const int idx, bool &lnon_zero) { | ||
const auto [t, u, v, k, j, i] = idxer(idx); | ||
const Real &val = bnd_info(b).var(iel, t, u, v, k, j, i); | ||
bnd_info(b).buf(idx + idx_offset) = val; | ||
lnon_zero = lnon_zero || (std::abs(val) >= threshold); | ||
const auto [t, u, v, k, j, i] = idxer(idx * Ni); | ||
Real *var = &bnd_info(b).var(iel, t, u, v, k, j, i); | ||
Real *buf = &bnd_info(b).buf(idx * Ni + idx_offset); | ||
|
||
Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni), | ||
[&](int m) { buf[m] = var[m]; }); | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need a team barrier here (as we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought that we didn't because each element of |
||
bool mnon_zero = false; | ||
Kokkos::parallel_reduce( | ||
Kokkos::ThreadVectorRange<>(team_member, Ni), | ||
Comment on lines
+114
to
+115
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this reduce split from the above for? They go over the same range and only use local information, so it could all be in one reduce, doesn't it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could be all one reduce. I think I had split it up to see what the impact of removing the reduction was (since it is unecessary for dense variables). This may be related to why you see a little worse performance with newer versions of Parthenon compared to versions before my changes to sparse related infrastructure. There were a number of things that were turned off via macros at compile time in the boundary packing related to sparse that are not now (but we could go back to including that if that is the source of your performance difference). |
||
[&](int m, bool &llnon_zero) { | ||
llnon_zero = llnon_zero || (std::abs(buf[m]) >= threshold); | ||
}, | ||
Kokkos::LOr<bool, parthenon::DevMemSpace>(mnon_zero)); | ||
|
||
lnon_zero = lnon_zero || mnon_zero; | ||
Comment on lines
+119
to
+121
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need a barrier here? I have't looked into the blocking/nonblocking nature of inner par reduces to device memory (though they'd always be to device memory) so it might be a semantic question. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought we didn't because I thought the inner |
||
}, | ||
Kokkos::LOr<bool, parthenon::DevMemSpace>(non_zero[iel])); | ||
idx_offset += idxer.size(); | ||
|
@@ -229,21 +243,41 @@ TaskStatus SetBounds(std::shared_ptr<MeshData<Real>> &md) { | |
int idx_offset = 0; | ||
for (int iel = 0; iel < bnd_info(b).ntopological_elements; ++iel) { | ||
auto &idxer = bnd_info(b).idxer[iel]; | ||
const int Ni = idxer.template EndIdx<5>() - idxer.template StartIdx<5>() + 1; | ||
if (bnd_info(b).buf_allocated && bnd_info(b).allocated) { | ||
Kokkos::parallel_for(Kokkos::TeamThreadRange<>(team_member, idxer.size()), | ||
[&](const int idx) { | ||
const auto [t, u, v, k, j, i] = idxer(idx); | ||
if (idxer.IsActive(k, j, i)) | ||
bnd_info(b).var(iel, t, u, v, k, j, i) = | ||
bnd_info(b).buf(idx + idx_offset); | ||
}); | ||
Kokkos::parallel_for( | ||
Kokkos::TeamThreadRange<>(team_member, idxer.size() / Ni), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As above have you tried a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As above, I tried |
||
[&](const int idx) { | ||
const auto [t, u, v, k, j, i] = idxer(idx * Ni); | ||
Real *var = &bnd_info(b).var(iel, t, u, v, k, j, i); | ||
Real *buf = &bnd_info(b).buf(idx * Ni + idx_offset); | ||
// Have to do this because of some weird issue about structure bindings | ||
// being captured | ||
const int kk = k; | ||
const int jj = j; | ||
const int ii = i; | ||
Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni), | ||
[&](int m) { | ||
if (idxer.IsActive(kk, jj, ii + m)) | ||
var[m] = buf[m]; | ||
}); | ||
}); | ||
} else if (bnd_info(b).allocated) { | ||
const Real default_val = bnd_info(b).var.sparse_default_val; | ||
Kokkos::parallel_for(Kokkos::TeamThreadRange<>(team_member, idxer.size()), | ||
[&](const int idx) { | ||
const auto [t, u, v, k, j, i] = idxer(idx); | ||
bnd_info(b).var(iel, t, u, v, k, j, i) = default_val; | ||
}); | ||
Kokkos::parallel_for( | ||
Kokkos::TeamThreadRange<>(team_member, idxer.size() / Ni), | ||
[&](const int idx) { | ||
const auto [t, u, v, k, j, i] = idxer(idx * Ni); | ||
Real *var = &bnd_info(b).var(iel, t, u, v, k, j, i); | ||
const int kk = k; | ||
const int jj = j; | ||
const int ii = i; | ||
Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni), | ||
[&](int m) { | ||
if (idxer.IsActive(kk, jj, ii + m)) | ||
var[m] = default_val; | ||
}); | ||
}); | ||
} | ||
idx_offset += idxer.size(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, have you tried a
TeamVectorRange
and/or the newTeamMDRange
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried
TeamVectorRange
(as a single level of parallelism), but that seemed to be slower at least on CPU. I have not experimented withTeamMDRange
, I think it would be interesting to try this out.As you pointed out, it is not clear what the origin of the performance boost from the changes is. My original thought was that it would promote vectorization on CPU when the
i
-direction, which would likely be beneficial when thei
-direction is notnghost
. That being said, it is also possible that there is a little benefit from the fact that we perform the indexing calculations significantly fewer times with the current setup.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any documentation for
TeamMDRange
? Can't seem to find any. What does it do? @lroberts36 are you going to try this, or should we just merge what we have and try this out later?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdolence: I believe
TeamMDRange
should give us a Kokkos native way to have an inner loop go over multiple indices (rather than having the loop go over a single index and then calculate the multi-dimensional indices by hand as we do now). I don't necessarily see whyTeamMDRange
would give a performance boost, but that doesn't mean we shouldn't test it at some point. I would leave that for a future PR though.