Add distributed capabilities #1133

MarcelKoch · 2022-10-05T07:44:45Z

This PR will add basic, distributed data structures (matrix and vector), and enable some solvers for these types. This PR contains the following PRs:

Changes

moves new types into experimental namespace
makes generic_scoped_device_id_guard destructor noexcept by terminating if restoring the original device id fails
switches to blocking communication in the SpMV if OpenMPI version 4.0.x is used
disables Horeka mpi tests and uses nla-gpu instead

MarcelKoch · 2022-10-06T15:23:42Z

format!

yhmtsai · 2022-10-18T07:14:51Z

Do all commits belong to this pull request?

MarcelKoch · 2022-10-18T07:16:45Z

@yhmtsai I would say only the recent ones that were not covered in the earlier PRs need to be reviewed. So only the commits after the last merge commit.

yhmtsai · 2022-10-19T05:40:20Z

@MarcelKoch Due to too long commit history and repeated commit from earlier pr, it will be squashed commit when we merge it.
To keep these commits, could you create a backup branch and rebase this branch with dropping those merged commits?
In the end, copying all files from backup to this branch to ensure there's no missing change during rebasing.
you can still keep this branch as a backup and rebase in another branch.
Sorry, I misunderstand the messages.
Those commits are from another pull request into this branch not the develop branch, so we should keep the commits

CMakeLists.txt

common/unified/solver/cg_kernels.cpp

core/distributed/matrix.cpp

yhmtsai · 2022-10-20T09:06:21Z

core/test/mpi/base/rank_mapping.cpp

+          env({{"MV2_COMM_WORLD_LOCAL_RANK", ""},
+               {"OMPI_COMM_WORLD_LOCAL_RANK", ""},
+               {"MPI_LOCALRANKID", ""},
+               {"SLURM_LOCALID", ""}})


does changing the environment after MPI_Init during the program has effect?

It might. In each test I set the env variable explicitly before I call the mapping function. So if the env var is changed in between that then the test might fail. But I think for our containers that shouldn't be an issue.
Also, I reset the environment to what it was at the time SetUp is called, so it won't reset to the changes made in between.

core/test/mpi/gtest/mpi_listener.cpp

include/ginkgo/config.hpp.in

yhmtsai · 2022-10-20T09:11:56Z

include/ginkgo/core/base/mpi.hpp

+     * @param recv_type  the MPI_Datatype for the recv buffer
+     * @param comm  the communicator
+     */
+    void all_to_all_v(std::shared_ptr<const Executor> exec,


I think it should be like the above version using SendType and RecvType?
also scaled the offset by the type

Could you expand on in which way this should be like the templated one? This one is needed for the custom MPI_Datatype to handle multiple right-hand-sides.

If MPI_Datatype is 8 bytes, will send_buffer+1 or send_buffer+2 (or the send_offsets 1 or 2) to access the send[1]?

I think that is an MPI detail that we don't need to care about.

I think we need to do because this function has the offset here?

we have two version for that

with the templated type -> it will add the MPItype and then pass to the following option

with the void type

From my view on this interface, I will pass the offset like (1, 2, 3) no matter what type is because it takes care of the type. However, I will consider to pass the offset like (2, 4, 6) when the type size is 8 bytes for the void version because I do not know whether the function will take care of the type size for the offset.

After reading some documentation and the implementation of 1, it seems to take care of the type size. The alltoallw is another version which only consider the offset in bytes.

It does not need to add the template on the void version but could you add some short documentation in void version?

How the offset is interpreted depends on the MPI_Datatype, I think you can read up on that somewhere in the MPI standard. So the offset is not in byte, but something depending on the datatype.
All of these functions are only wrapper for the corresponding MPI functions, so I don't think we should add documentation here. The only useful documentation I could think of is referring to the MPI standard, but that seems superfluous.

TBH, the documentation in MPI is not clear to me. At least from MPICH, they only specify MPI_Alltoallw is in byte, but not mention specific in MPI_Alltoallv. Without looking into MPI_alltoallw, I will not think MPI_alltoallv might using element-wise.
Okay, I found more clear document in openmpi and rookiehpc not like MPICH. they specify it is in units of sendtype
I mean in the param doc not the function description.
something like the element-wise offsets for the send buffer or the offsets (in units of send_type) for the send buffer for send_offsets in void version
If you think it is still confusing or might not match the mpi document, you can keep the original version.

We don't explain the MPI functions here, and I don't think it would be our job to do so. One option would be to remove the parameter description on all of our MPI wrapper and just keep the reference to the MPI standard. Would you be on board with that?

I already elaborated a bit on that on slack, but I think this function should at the very least have a different name, preferably be more specialized to allow making it type-safe instead of void*-typed and resolving the whole offset question.

include/ginkgo/core/distributed/vector.hpp

test/mpi/distributed/matrix.cpp

test/utils/mpi/executor.hpp

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

this is necessary (at least for the stopping status) because in MPI runs, some processes may have zero local dofs and thus the initialization would be skipped

On two different systems using openmpi 4.0.x results in deadlocks in our distributed solver test. For versions 4.1.[34] the deadlock disappears, and intel mpi and mvapich2 also don't show a deadlock.

- documentation - set random engine seed explicitly Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>

- disables test with UNIX calls on windows - removes wrong documentation Co-authored-by: Terry Cojean <terry.cojean@kit.edu> Co-authored-by: Pratik Nayak <pratik.nayak@kit.edu>

- explicitly request MPI version 3.1 Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>

Co-authored-by: Terry Cojean <terry.cojean@kit.edu> Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

.gitlab-ci.yml

ginkgo-bot · 2022-10-31T12:52:59Z

Note: This PR changes the Ginkgo ABI:

Functions changes summary: 1266 Removed, 633 Changed (42119 filtered out), 2596 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable

For details check the full ABI diff under Artifacts here

sonarqubecloud · 2022-10-31T19:27:36Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
255 Code Smells

81.6% Coverage
5.8% Duplication

codecov · 2022-10-31T19:31:56Z

Codecov Report

Base: 90.87% // Head: 91.98% // Increases project coverage by +1.10% 🎉

Coverage data is based on head (b59a9dd) compared to base (cff9742).
Patch coverage: 95.73% of modified lines in pull request are covered.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1133      +/-   ##
===========================================
+ Coverage    90.87%   91.98%   +1.10%     
===========================================
  Files          508      535      +27     
  Lines        44294    46228    +1934     
===========================================
+ Hits         40254    42524    +2270     
+ Misses        4040     3704     -336

Impacted Files	Coverage Δ
common/unified/distributed/partition_kernels.cpp	`100.00% <ø> (ø)`
core/device_hooks/common_kernels.inc.cpp	`0.00% <0.00%> (ø)`
core/device_hooks/cuda_hooks.cpp	`51.28% <0.00%> (-7.55%)`	⬇️
core/device_hooks/dpcpp_hooks.cpp	`23.68% <0.00%> (-3.59%)`	⬇️
core/device_hooks/hip_hooks.cpp	`51.28% <0.00%> (-7.55%)`	⬇️
core/distributed/partition.cpp	`97.82% <ø> (ø)`
core/matrix/dense.cpp	`93.78% <ø> (+0.43%)`	⬆️
core/solver/bicg.cpp	`82.35% <ø> (ø)`
core/test/base/abstract_factory.cpp	`100.00% <ø> (ø)`
core/test/base/mtx_io.cpp	`98.86% <ø> (ø)`
... and 123 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

MarcelKoch added the 1:ST:ready-for-review This PR is ready for review label Oct 5, 2022

MarcelKoch added this to the Ginkgo 1.5.0 milestone Oct 5, 2022

MarcelKoch self-assigned this Oct 5, 2022

MarcelKoch requested a review from a team October 5, 2022 07:45

pratikvn added the type:distributed-functionality label Oct 5, 2022

MarcelKoch mentioned this pull request Oct 5, 2022

Adds simple functions to create Dense view #1136

Merged

MarcelKoch force-pushed the distributed-develop branch 2 times, most recently from f72dd76 to 699d4da Compare October 11, 2022 07:35

MarcelKoch mentioned this pull request Oct 19, 2022

Fixes convergence/papi logger for distributed vectors #1147

Merged

yhmtsai reviewed Oct 20, 2022

View reviewed changes

MarcelKoch added the 1:ST:high-importance This issue/PR is of high importance and must be addressed as soon as possible. label Oct 20, 2022

tcojean requested a review from yhmtsai October 20, 2022 15:13

MarcelKoch and others added 19 commits October 31, 2022 09:21

adds tests for rank to device-id mapping

ac9f49f

moves new distributed additions into experimental namespace

5e42487

Format files

6095d88

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

makes generic_scoped_device_id_guard destructor noexcept

e307c8a

fixes MPI tests

92b8958

checks correct comm size only when necessary in tests

15185ff

returns correct exit code from mpi tests

3b90dbe

always initializes scalar solver variables (CG/CGS/FCG/BICG/BICGSTAB)

ff631e0

this is necessary (at least for the stopping status) because in MPI runs, some processes may have zero local dofs and thus the initialization would be skipped

adds all_to_all_v overload with mpi datatypes

289c01d

uses blocking comm if openmpi version is less than 4.1

3ebf663

On two different systems using openmpi 4.0.x results in deadlocks in our distributed solver test. For versions 4.1.[34] the deadlock disappears, and intel mpi and mvapich2 also don't show a deadlock.

removes false positives in test for openmpi version

2c05a79

review updates:

9731f45

- documentation - set random engine seed explicitly Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>

moves Partition into experimental namespace

2bc941c

moves distributed dispatch into experimental namespace

a9c6076

review updates:

db86d19

- disables test with UNIX calls on windows - removes wrong documentation Co-authored-by: Terry Cojean <terry.cojean@kit.edu> Co-authored-by: Pratik Nayak <pratik.nayak@kit.edu>

review updates:

7757bf5

- explicitly request MPI version 3.1 Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>

moves mpi wrapper into experimental namespace

832d184

Co-authored-by: Terry Cojean <terry.cojean@kit.edu> Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>

Format files

8844c90

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

adds explicit error message to device id guard destructor failure

ca3538a

Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

MarcelKoch force-pushed the distributed-develop branch from e86d6cc to 70f1120 Compare October 31, 2022 08:21

tcojean reviewed Oct 31, 2022

View reviewed changes

.gitlab-ci.yml Outdated Show resolved Hide resolved

MarcelKoch added 2 commits October 31, 2022 12:52

replaces extremely slow horeka test with nla-gpu

8f8f5a0

fixes scoped device id guard test

b59a9dd

MarcelKoch force-pushed the distributed-develop branch from 70f1120 to b59a9dd Compare October 31, 2022 12:00

MarcelKoch merged commit c1f8bd4 into develop Oct 31, 2022

MarcelKoch deleted the distributed-develop branch October 31, 2022 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed capabilities #1133

Add distributed capabilities #1133

MarcelKoch commented Oct 5, 2022 •

edited

Loading

MarcelKoch commented Oct 6, 2022

yhmtsai commented Oct 18, 2022

MarcelKoch commented Oct 18, 2022

yhmtsai commented Oct 19, 2022 •

edited

Loading

yhmtsai Oct 20, 2022

MarcelKoch Oct 20, 2022

yhmtsai Oct 20, 2022

MarcelKoch Oct 20, 2022

yhmtsai Oct 24, 2022

MarcelKoch Oct 25, 2022

yhmtsai Oct 25, 2022

yhmtsai Oct 26, 2022

MarcelKoch Oct 26, 2022

yhmtsai Oct 27, 2022

MarcelKoch Oct 27, 2022

upsj Oct 31, 2022

ginkgo-bot commented Oct 31, 2022

sonarqubecloud bot commented Oct 31, 2022

codecov bot commented Oct 31, 2022

Add distributed capabilities #1133

Add distributed capabilities #1133

Conversation

MarcelKoch commented Oct 5, 2022 • edited Loading

Changes

MarcelKoch commented Oct 6, 2022

yhmtsai commented Oct 18, 2022

MarcelKoch commented Oct 18, 2022

yhmtsai commented Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ginkgo-bot commented Oct 31, 2022

sonarqubecloud bot commented Oct 31, 2022

codecov bot commented Oct 31, 2022

Codecov Report

MarcelKoch commented Oct 5, 2022 •

edited

Loading

yhmtsai commented Oct 19, 2022 •

edited

Loading