-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a MPI layer, some bindings, CMake setup and MPI test framework #908
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a few areas that need some more discussion before we can merge this, namely:
- ginkgo_mpi as a separate shared library probably won't work consistently, especially with the deep integration into solvers.
- declaring MPI functionality in headers with GKO_BUILD_MPI could lead to a lot of issues with binary incompatibility, I think we should consider including these files/their contents only conditionally. For an example, check
distributed-matrix
- personally I prefer to have the wrappers available as function templates in the headers, since they can then be easily extended to custom types
Codecov Report
@@ Coverage Diff @@
## distributed-develop #908 +/- ##
=======================================================
- Coverage 94.39% 93.30% -1.09%
=======================================================
Files 453 459 +6
Lines 36907 37578 +671
=======================================================
+ Hits 34838 35063 +225
- Misses 2069 2515 +446
Continue to review full report at Codecov.
|
bfe4bbf
to
ae22432
Compare
@upsj, declaring files conditionally has some side effects:
Can you elaborate on what you mean by possible binary incompatibilities ? |
I am not sure what else we can do here. I think putting the MPI symbols into ginkgo_core conditionally is not good either and I don't think we can put it anywhere else. What kind of problems do you foresee with ginkgo_mpi as a separate shared library ? |
With the current approach as well, the user facing wrappers are templated. This should allow for easy extension to custom types as well, with automatic mpi_type deduction and unwrapping of the data of custom types such as Also, extension to custom types usually needs specific overloads for the reason that you usually don't need to specify counts, which you need to for the POD types, so the function signature needs to be different in my opinion. |
@pratikvn yes, unfortunately I see no nice solution to this dilemma - distributed matrix and vector need MPI functionality, i.e. they need to store a MPI communicator, i.e. the type of MPI_Comm must be known in the headers. If you compile Ginkgo without MPI, then the corresponding headers are empty thanks to the If you tried to pass a distributed vector (which you got by hacking around in the types) to a non-distributed solver, you'd simply get a runtime error, since it expects |
It sounds to me like you are talking about the function declarations here, but without a function definition available in the header, you as a user cannot instantiate the function for custom types, even if they would have nicely corresponding MPI types. |
The issue is at the interaction with solvers - we need three things: A dispatch that takes care of the distinction between Dense and distributed Vector, a unpacking function that returns the underlying local Dense vector for kernel calls, and a reduction function that can be used to allreduce the result of custom reductions like in CGS GMRES. All of them need So with fundamental parts of the distributed integration not being part of Don't get me wrong, it it quite possible that there is a clean solution to implement this, but so far I haven't seen anything close to it that doesn't require large amounts of boilerplate code, or complicates user interaction. |
Maybe I need to clarify what I meant. We need to decide on a couple of things which will probably push us towards one decision or the other.
std::ifstream stream("fname");
auto data = gko::read_raw<>(stream);
auto mat = gko::matrix::Csr<>::create(exec); // single exec
auto dist_mat = gko::distributed::Matrix<Csr<>>::create(exec, comm); // distributed object
auto partition = gko::Partition::build(exec, ranges); // build partition
if(is_distributed()){ // Some way to say distributed functionality has been compiled for.
dist_mat->read_distributed(data, partition);
} else {
mat->read(data);
}
If we do, then we need a dummy implementation. In This needs some input from @tcojean and @fritzgoebel as they have some experience/input from the OpenCARP side.
// Some core algorithm
// Do common stuff with LinOp's (both non-distributed and distributed have that as base type)
if(is_distributed()){
// Call MPI functions or work on distributed objects with dynamic_casts.
} else {
// Do single executor stuff.
}
// Do common stuff with LinOp's (both non-distributed and distributed have that as base type) In other words, can we guarantee that the core algorithm will be the same for non-distributed and distributed for all current and future algorithms. I think this would be a pretty strict restriction. Both of the above, will also work with |
I'm not in a position yet to give much feedback on all those issues, but I think the |
To collect the summary of our discussions and decisions:
@upsj and @MarcelKoch, please feel free to correct me if I am missing something. |
Just to clarify, the executor specific kernels of some distributed classes/functions will still be compiled into the corresponding device library regardless of the MPI status, since these kernels do not require MPI in any way. |
909ff7b
to
5fd2863
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for adding all of this, especially the CI. I have some mostly smaller remarks, below. Larger issues like using value semantics for most mpi wrapper classes or using fully templated functions can be addressed at a later point.
include/ginkgo/core/base/mpi.hpp
Outdated
* for our purposes. As the class or object goes out of scope, the communicator | ||
* is freed. | ||
*/ | ||
class communicator : public EnableSharedCreateMethod<communicator> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One slight issue with using value semantics, is that it requires a call to MPI_COMM_DUP
, which is a collective operation, so the overhead would probably increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice job with the communicator wrapper. I noticed one mistake in the tests, and there are other relatively minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The communicator implementation looks good if I look at the owning case, the non-owning case and its interaction with owning communicators still has a few rough edges to iron out. They are mainly visible in the assignment operators, but also to a lesser degree in the constructors:
Copy-assignment a = b
a owning | a non-owning | |
---|---|---|
b owning | duplicate | duplicate |
b non-owning | duplicate? | duplicate? |
Move-assignment a = std::move(b)
should never duplicate, but should b
be empty afterwards, if it was non-owning? Should a
be turned from owning to non-owning by this operation?
Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
Pratik and I discussed this behavior conundrum, and following the design of Boost.MPI, came to the conclusion that it makes sense to take care of conditional ownership + duplication only at the application-Ginkgo boundary, i.e. the communicator constructor, and store the communicator in a shared_ptr with a custom deleter to avoid unnecessary duplication if we have many distributed objects in a custom deleter. |
What are the implications of this? What I understand from this is that you would then:
Is there something I misunderstood? |
Yes, basically this means that there are two states for the shared_ptr<MPI_Comm> internally, ones with a plain deleter (non-owning) that just deletes the MPI_Comm pointer, and ones with a deleter that calls MPI_Comm_free and deletes the pointer afterwards. This state is kept during copy construction and assignment (move construction and assignment are disabled/handled by copy construction and assignment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a few remaining small nits:
- Formatting: We usually use single empty line between member functions, that's a bit inconsistent in the file right now
- Caching rank/size/local rank inside the communicator
- is_gpu_aware is vendor-specific, since we can now compile CUDA and ROCm simultaneously, maybe DPC++ in the future, as well?
- you removed the
request
type, I think that could still be really useful with appropriate member functions .wait() and .test() and maybe static member functions wait_all() and test_all()?
GKO_ASSERT(*comm != MPI_COMM_NULL); | ||
GKO_ASSERT_NO_MPI_ERRORS(MPI_Comm_free(comm)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note, not necessarily something we need to change: This throws/aborts inside a destructor, which may not be ideal. MPI_Comm_free probably also checks against MPI_COMM_NULL (considering how eager the MPI impls I checked are to abort on slight issues)
GKO_ASSERT(*comm != MPI_COMM_NULL); | |
GKO_ASSERT_NO_MPI_ERRORS(MPI_Comm_free(comm)); | |
if (MPI_Comm_free(comm) != MPI_SUCCESS) { | |
// something like in CudaExecutor::raw_free | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the excellent work. I have only a minor remark on the constness of the member functions. Nearly every member function of communicator and window can be marked const. The underlying MPI object is always copied, and I think very few calls are documented to change the MPI object.
Also, I second @upsj suggestion to add a request
type, to really make it consistent.
Co-authored-by: Tobias Ribizel <ribizel@kit.edu> Co-authored-by: Marcel Koch <marcel.koch@kit.edu> Co-authored-by: Aditya Kashi <aditya.kashi@kit.edu>
2c6219e
to
37b8a44
Compare
Kudos, SonarCloud Quality Gate passed! |
Advertise release 1.5.0 and last changes + Add changelog, + Update third party libraries + A small fix to a CMake file See PR: #1195 The Ginkgo team is proud to announce the new Ginkgo minor release 1.5.0. This release brings many important new features such as: - MPI-based multi-node support for all matrix formats and most solvers; - full DPC++/SYCL support, - functionality and interface for GPU-resident sparse direct solvers, - an interface for wrapping solvers with scaling and reordering applied, - a new algebraic Multigrid solver/preconditioner, - improved mixed-precision support, - support for device matrix assembly, and much more. If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions). Supported systems and requirements: + For all platforms, CMake 3.13+ + C++14 compliant compiler + Linux and macOS + GCC: 5.5+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + NVHPC: 22.7+ + Cray Compiler: 14.0.1+ + CUDA module: CUDA 9.2+ or NVHPC 22.7+ + HIP module: ROCm 4.0+ + DPC++ module: Intel OneAPI 2021.3 with oneMKL and oneDPL. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: GCC 5.5+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.2+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add MPI-based multi-node for all matrix formats and solvers (except GMRES and IDR). ([#676](#676), [#908](#908), [#909](#909), [#932](#932), [#951](#951), [#961](#961), [#971](#971), [#976](#976), [#985](#985), [#1007](#1007), [#1030](#1030), [#1054](#1054), [#1100](#1100), [#1148](#1148)) + Porting the remaining algorithms (preconditioners like ISAI, Jacobi, Multigrid, ParILU(T) and ParIC(T)) to DPC++/SYCL, update to SYCL 2020, and improve support and performance ([#896](#896), [#924](#924), [#928](#928), [#929](#929), [#933](#933), [#943](#943), [#960](#960), [#1057](#1057), [#1110](#1110), [#1142](#1142)) + Add a Sparse Direct interface supporting GPU-resident numerical LU factorization, symbolic Cholesky factorization, improved triangular solvers, and more ([#957](#957), [#1058](#1058), [#1072](#1072), [#1082](#1082)) + Add a ScaleReordered interface that can wrap solvers and automatically apply reorderings and scalings ([#1059](#1059)) + Add a Multigrid solver and improve the aggregation based PGM coarsening scheme ([#542](#542), [#913](#913), [#980](#980), [#982](#982), [#986](#986)) + Add infrastructure for unified, lambda-based, backend agnostic, kernels and utilize it for some simple kernels ([#833](#833), [#910](#910), [#926](#926)) + Merge different CUDA, HIP, DPC++ and OpenMP tests under a common interface ([#904](#904), [#973](#973), [#1044](#1044), [#1117](#1117)) + Add a device_matrix_data type for device-side matrix assembly ([#886](#886), [#963](#963), [#965](#965)) + Add support for mixed real/complex BLAS operations ([#864](#864)) + Add a FFT LinOp for all but DPC++/SYCL ([#701](#701)) + Add FBCSR support for NVIDIA and AMD GPUs and CPUs with OpenMP ([#775](#775)) + Add CSR scaling ([#848](#848)) + Add array::const_view and equivalent to create constant matrices from non-const data ([#890](#890)) + Add a RowGatherer LinOp supporting mixed precision to gather dense matrix rows ([#901](#901)) + Add mixed precision SparsityCsr SpMV support ([#970](#970)) + Allow creating CSR submatrix including from (possibly discontinuous) index sets ([#885](#885), [#964](#964)) + Add a scaled identity addition (M <- aI + bM) feature interface and impls for Csr and Dense ([#942](#942)) Deprecations and important changes: + Deprecate AmgxPgm in favor of the new Pgm name. ([#1149](#1149)). + Deprecate specialized residual norm classes in favor of a common `ResidualNorm` class ([#1101](#1101)) + Deprecate CamelCase non-polymorphic types in favor of snake_case versions (like array, machine_topology, uninitialized_array, index_set) ([#1031](#1031), [#1052](#1052)) + Bug fix: restrict gko::share to rvalue references (*possible interface break*) ([#1020](#1020)) + Bug fix: when using cuSPARSE's triangular solvers, specifying the factory parameter `num_rhs` is now required when solving for more than one right-hand side, otherwise an exception is thrown ([#1184](#1184)). + Drop official support for old CUDA < 9.2 ([#887](#887)) Improved performance additions: + Reuse tmp storage in reductions in solvers and add a mutable workspace to all solvers ([#1013](#1013), [#1028](#1028)) + Add HIP unsafe atomic option for AMD ([#1091](#1091)) + Prefer vendor implementations for Dense dot, conj_dot and norm2 when available ([#967](#967)). + Tuned OpenMP SellP, COO, and ELL SpMV kernels for a small number of RHS ([#809](#809)) Fixes: + Fix various compilation warnings ([#1076](#1076), [#1183](#1183), [#1189](#1189)) + Fix issues with hwloc-related tests ([#1074](#1074)) + Fix include headers for GCC 12 ([#1071](#1071)) + Fix for simple-solver-logging example ([#1066](#1066)) + Fix for potential memory leak in Logger ([#1056](#1056)) + Fix logging of mixin classes ([#1037](#1037)) + Improve value semantics for LinOp types, like moved-from state in cross-executor copy/clones ([#753](#753)) + Fix some matrix SpMV and conversion corner cases ([#905](#905), [#978](#978)) + Fix uninitialized data ([#958](#958)) + Fix CUDA version requirement for cusparseSpSM ([#953](#953)) + Fix several issues within bash-script ([#1016](#1016)) + Fixes for `NVHPC` compiler support ([#1194](#1194)) Other additions: + Simplify and properly name GMRES kernels ([#861](#861)) + Improve pkg-config support for non-CMake libraries ([#923](#923), [#1109](#1109)) + Improve gdb pretty printer ([#987](#987), [#1114](#1114)) + Add a logger highlighting inefficient allocation and copy patterns ([#1035](#1035)) + Improved and optimized test random matrix generation ([#954](#954), [#1032](#1032)) + Better CSR strategy defaults ([#969](#969)) + Add `move_from` to `PolymorphicObject` ([#997](#997)) + Remove unnecessary device_guard usage ([#956](#956)) + Improvements to the generic accessor for mixed-precision ([#727](#727)) + Add a naive lower triangular solver implementation for CUDA ([#764](#764)) + Add support for int64 indices from CUDA 11 onward with SpMV and SpGEMM ([#897](#897)) + Add a L1 norm implementation ([#900](#900)) + Add reduce_add for arrays ([#831](#831)) + Add utility to simplify Dense View creation from an existing Dense vector ([#1136](#1136)). + Add a custom transpose implementation for Fbcsr and Csr transpose for unsupported vendor types ([#1123](#1123)) + Make IDR random initilization deterministic ([#1116](#1116)) + Move the algorithm choice for triangular solvers from Csr::strategy_type to a factory parameter ([#1088](#1088)) + Update CUDA archCoresPerSM ([#1175](#1116)) + Add kernels for Csr sparsity pattern lookup ([#994](#994)) + Differentiate between structural and numerical zeros in Ell/Sellp ([#1027](#1027)) + Add a binary IO format for matrix data ([#984](#984)) + Add a tuple zip_iterator implementation ([#966](#966)) + Simplify kernel stubs and declarations ([#888](#888)) + Simplify GKO_REGISTER_OPERATION with lambdas ([#859](#859)) + Simplify copy to device in tests and examples ([#863](#863)) + More verbose output to array assertions ([#858](#858)) + Allow parallel compilation for Jacobi kernels ([#871](#871)) + Change clang-format pointer alignment to left ([#872](#872)) + Various improvements and fixes to the benchmarking framework ([#750](#750), [#759](#759), [#870](#870), [#911](#911), [#1033](#1033), [#1137](#1137)) + Various documentation improvements ([#892](#892), [#921](#921), [#950](#950), [#977](#977), [#1021](#1021), [#1068](#1068), [#1069](#1069), [#1080](#1080), [#1081](#1081), [#1108](#1108), [#1153](#1153), [#1154](#1154)) + Various CI improvements ([#868](#868), [#874](#874), [#884](#884), [#889](#889), [#899](#899), [#903](#903), [#922](#922), [#925](#925), [#930](#930), [#936](#936), [#937](#937), [#958](#958), [#882](#882), [#1011](#1011), [#1015](#1015), [#989](#989), [#1039](#1039), [#1042](#1042), [#1067](#1067), [#1073](#1073), [#1075](#1075), [#1083](#1083), [#1084](#1084), [#1085](#1085), [#1139](#1139), [#1178](#1178), [#1187](#1187))
This PR aims to add the MPI base framework to Ginkgo in a experimental
distributed-develop
branch.Partially closes #907