Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure buffer packing kernels #938

Merged
merged 7 commits into from
Oct 6, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
- [[PR 890]](https://github.com/parthenon-hpc-lab/parthenon/pull/890) Fix bugs in sparse communication and prolongation

### Infrastructure (changes irrelevant to downstream codes)
- [[PR 938]](https://github.com/parthenon-hpc-lab/parthenon/pull/938) Restructure buffer packing/unpacking kernel hierarchical parallelism
- [[PR 904]](https://github.com/parthenon-hpc-lab/parthenon/pull/904) Move to prolongation/restriction in one for AMR and communicate non-cell centered fields
- [[PR 918]](https://github.com/parthenon-hpc-lab/parthenon/pull/918) Refactor RegionSize
- [[PR 901]](https://github.com/parthenon-hpc-lab/parthenon/pull/901) Implement shared element ownership model
Expand Down
68 changes: 51 additions & 17 deletions src/bvals/comms/boundary_communication.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ TaskStatus SendBoundBufs(std::shared_ptr<MeshData<Real>> &md) {
PARTHENON_DEBUG_REQUIRE(bnd_info.size() == nbound, "Need same size for boundary info");
auto &sending_nonzero_flags = cache.sending_non_zero_flags;
auto &sending_nonzero_flags_h = cache.sending_non_zero_flags_h;

Kokkos::parallel_for(
"SendBoundBufs",
Kokkos::TeamPolicy<>(parthenon::DevExecSpace(), nbound, Kokkos::AUTO),
Expand All @@ -98,13 +99,26 @@ TaskStatus SendBoundBufs(std::shared_ptr<MeshData<Real>> &md) {
int idx_offset = 0;
for (int iel = 0; iel < bnd_info(b).ntopological_elements; ++iel) {
auto &idxer = bnd_info(b).idxer[iel];
const int Ni = idxer.template EndIdx<5>() - idxer.template StartIdx<5>() + 1;
Kokkos::parallel_reduce(
Kokkos::TeamThreadRange<>(team_member, idxer.size()),
Kokkos::TeamThreadRange<>(team_member, idxer.size() / Ni),
[&](const int idx, bool &lnon_zero) {
const auto [t, u, v, k, j, i] = idxer(idx);
const Real &val = bnd_info(b).var(iel, t, u, v, k, j, i);
bnd_info(b).buf(idx + idx_offset) = val;
lnon_zero = lnon_zero || (std::abs(val) >= threshold);
const auto [t, u, v, k, j, i] = idxer(idx * Ni);
Real *var = &bnd_info(b).var(iel, t, u, v, k, j, i);
Real *buf = &bnd_info(b).buf(idx * Ni + idx_offset);

Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, have you tried a TeamVectorRange and/or the new TeamMDRange?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried TeamVectorRange (as a single level of parallelism), but that seemed to be slower at least on CPU. I have not experimented with TeamMDRange, I think it would be interesting to try this out.

As you pointed out, it is not clear what the origin of the performance boost from the changes is. My original thought was that it would promote vectorization on CPU when the i-direction, which would likely be beneficial when the i-direction is not nghost. That being said, it is also possible that there is a little benefit from the fact that we perform the indexing calculations significantly fewer times with the current setup.

Copy link
Collaborator

@jdolence jdolence Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any documentation for TeamMDRange? Can't seem to find any. What does it do? @lroberts36 are you going to try this, or should we just merge what we have and try this out later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdolence: I believe TeamMDRange should give us a Kokkos native way to have an inner loop go over multiple indices (rather than having the loop go over a single index and then calculate the multi-dimensional indices by hand as we do now). I don't necessarily see why TeamMDRange would give a performance boost, but that doesn't mean we shouldn't test it at some point. I would leave that for a future PR though.

[&](int m) { buf[m] = var[m]; });

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a team barrier here (as we use buf[m] in the following par_reduce?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we didn't because each element of buf was only being accessed by the same thread that set it. Maybe I am misunderstanding the Kokkos model though. If we have to add a team barrier I think we should just combine the parallel_reduce with the parallel_for.

bool mnon_zero = false;
Kokkos::parallel_reduce(
Kokkos::ThreadVectorRange<>(team_member, Ni),
Comment on lines +114 to +115
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this reduce split from the above for? They go over the same range and only use local information, so it could all be in one reduce, doesn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be all one reduce. I think I had split it up to see what the impact of removing the reduction was (since it is unecessary for dense variables). This may be related to why you see a little worse performance with newer versions of Parthenon compared to versions before my changes to sparse related infrastructure. There were a number of things that were turned off via macros at compile time in the boundary packing related to sparse that are not now (but we could go back to including that if that is the source of your performance difference).

[&](int m, bool &llnon_zero) {
llnon_zero = llnon_zero || (std::abs(buf[m]) >= threshold);
},
Kokkos::LOr<bool, parthenon::DevMemSpace>(mnon_zero));

lnon_zero = lnon_zero || mnon_zero;
Comment on lines +119 to +121
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a barrier here? I have't looked into the blocking/nonblocking nature of inner par reduces to device memory (though they'd always be to device memory) so it might be a semantic question.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we didn't because I thought the inner parallel_reduce served as a barrier itself (i.e. I thought mnon_zero was set on return). But honestly I couldn't find much documentation about how this was supposed to behave.

},
Kokkos::LOr<bool, parthenon::DevMemSpace>(non_zero[iel]));
idx_offset += idxer.size();
Expand Down Expand Up @@ -229,21 +243,41 @@ TaskStatus SetBounds(std::shared_ptr<MeshData<Real>> &md) {
int idx_offset = 0;
for (int iel = 0; iel < bnd_info(b).ntopological_elements; ++iel) {
auto &idxer = bnd_info(b).idxer[iel];
const int Ni = idxer.template EndIdx<5>() - idxer.template StartIdx<5>() + 1;
if (bnd_info(b).buf_allocated && bnd_info(b).allocated) {
Kokkos::parallel_for(Kokkos::TeamThreadRange<>(team_member, idxer.size()),
[&](const int idx) {
const auto [t, u, v, k, j, i] = idxer(idx);
if (idxer.IsActive(k, j, i))
bnd_info(b).var(iel, t, u, v, k, j, i) =
bnd_info(b).buf(idx + idx_offset);
});
Kokkos::parallel_for(
Kokkos::TeamThreadRange<>(team_member, idxer.size() / Ni),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above have you tried a TeamVectorRange and/or the new TeamMDRange?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I tried TeamVectorRange but not TeamMDRange

[&](const int idx) {
const auto [t, u, v, k, j, i] = idxer(idx * Ni);
Real *var = &bnd_info(b).var(iel, t, u, v, k, j, i);
Real *buf = &bnd_info(b).buf(idx * Ni + idx_offset);
// Have to do this because of some weird issue about structure bindings
// being captured
const int kk = k;
const int jj = j;
const int ii = i;
Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni),
[&](int m) {
if (idxer.IsActive(kk, jj, ii + m))
var[m] = buf[m];
});
});
} else if (bnd_info(b).allocated) {
const Real default_val = bnd_info(b).var.sparse_default_val;
Kokkos::parallel_for(Kokkos::TeamThreadRange<>(team_member, idxer.size()),
[&](const int idx) {
const auto [t, u, v, k, j, i] = idxer(idx);
bnd_info(b).var(iel, t, u, v, k, j, i) = default_val;
});
Kokkos::parallel_for(
Kokkos::TeamThreadRange<>(team_member, idxer.size() / Ni),
[&](const int idx) {
const auto [t, u, v, k, j, i] = idxer(idx * Ni);
Real *var = &bnd_info(b).var(iel, t, u, v, k, j, i);
const int kk = k;
const int jj = j;
const int ii = i;
Kokkos::parallel_for(Kokkos::ThreadVectorRange<>(team_member, Ni),
[&](int m) {
if (idxer.IsActive(kk, jj, ii + m))
var[m] = default_val;
});
});
}
idx_offset += idxer.size();
}
Expand Down