-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed capabilities #1133
Conversation
format! |
f72dd76
to
699d4da
Compare
Do all commits belong to this pull request? |
@yhmtsai I would say only the recent ones that were not covered in the earlier PRs need to be reviewed. So only the commits after the last merge commit. |
|
env({{"MV2_COMM_WORLD_LOCAL_RANK", ""}, | ||
{"OMPI_COMM_WORLD_LOCAL_RANK", ""}, | ||
{"MPI_LOCALRANKID", ""}, | ||
{"SLURM_LOCALID", ""}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does changing the environment after MPI_Init during the program has effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might. In each test I set the env variable explicitly before I call the mapping function. So if the env var is changed in between that then the test might fail. But I think for our containers that shouldn't be an issue.
Also, I reset the environment to what it was at the time SetUp
is called, so it won't reset to the changes made in between.
* @param recv_type the MPI_Datatype for the recv buffer | ||
* @param comm the communicator | ||
*/ | ||
void all_to_all_v(std::shared_ptr<const Executor> exec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be like the above version using SendType and RecvType?
also scaled the offset by the type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on in which way this should be like the templated one? This one is needed for the custom MPI_Datatype to handle multiple right-hand-sides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If MPI_Datatype
is 8 bytes, will send_buffer+1
or send_buffer+2
(or the send_offsets 1 or 2) to access the send[1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is an MPI detail that we don't need to care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to do because this function has the offset here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have two version for that
- with the templated type -> it will add the MPItype and then pass to the following option
- with the void type
From my view on this interface, I will pass the offset like (1, 2, 3) no matter what type is because it takes care of the type. However, I will consider to pass the offset like (2, 4, 6) when the type size is 8 bytes for the void version because I do not know whether the function will take care of the type size for the offset.
After reading some documentation and the implementation of 1, it seems to take care of the type size. The alltoallw is another version which only consider the offset in bytes.
It does not need to add the template on the void version but could you add some short documentation in void version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How the offset is interpreted depends on the MPI_Datatype, I think you can read up on that somewhere in the MPI standard. So the offset is not in byte, but something depending on the datatype.
All of these functions are only wrapper for the corresponding MPI functions, so I don't think we should add documentation here. The only useful documentation I could think of is referring to the MPI standard, but that seems superfluous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, the documentation in MPI is not clear to me. At least from MPICH, they only specify MPI_Alltoallw is in byte, but not mention specific in MPI_Alltoallv. Without looking into MPI_alltoallw, I will not think MPI_alltoallv might using element-wise.
Okay, I found more clear document in openmpi and rookiehpc not like MPICH. they specify it is in units of sendtype
I mean in the param doc not the function description.
something like the element-wise offsets for the send buffer
or the offsets (in units of send_type) for the send buffer
for send_offsets
in void version
If you think it is still confusing or might not match the mpi document, you can keep the original version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't explain the MPI functions here, and I don't think it would be our job to do so. One option would be to remove the parameter description on all of our MPI wrapper and just keep the reference to the MPI standard. Would you be on board with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already elaborated a bit on that on slack, but I think this function should at the very least have a different name, preferably be more specialized to allow making it type-safe instead of void*
-typed and resolving the whole offset question.
Co-authored-by: Marcel Koch <marcel.koch@kit.edu>
this is necessary (at least for the stopping status) because in MPI runs, some processes may have zero local dofs and thus the initialization would be skipped
On two different systems using openmpi 4.0.x results in deadlocks in our distributed solver test. For versions 4.1.[34] the deadlock disappears, and intel mpi and mvapich2 also don't show a deadlock.
- documentation - set random engine seed explicitly Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>
- disables test with UNIX calls on windows - removes wrong documentation Co-authored-by: Terry Cojean <terry.cojean@kit.edu> Co-authored-by: Pratik Nayak <pratik.nayak@kit.edu>
- explicitly request MPI version 3.1 Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>
Co-authored-by: Terry Cojean <terry.cojean@kit.edu> Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>
Co-authored-by: Marcel Koch <marcel.koch@kit.edu>
Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
e86d6cc
to
70f1120
Compare
70f1120
to
b59a9dd
Compare
Note: This PR changes the Ginkgo ABI:
For details check the full ABI diff under Artifacts here |
Kudos, SonarCloud Quality Gate passed!
|
Codecov ReportBase: 90.87% // Head: 91.98% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## develop #1133 +/- ##
===========================================
+ Coverage 90.87% 91.98% +1.10%
===========================================
Files 508 535 +27
Lines 44294 46228 +1934
===========================================
+ Hits 40254 42524 +2270
+ Misses 4040 3704 -336
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
This PR will add basic, distributed data structures (matrix and vector), and enable some solvers for these types. This PR contains the following PRs:
Changes
experimental
namespacegeneric_scoped_device_id_guard
destructornoexcept
by terminating if restoring the original device id fails