-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed vector #961
Distributed vector #961
Conversation
format-rebase! |
Error: Rebase failed, see the related Action for details |
5167766
to
0b67b8f
Compare
format! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first look over the code. The class structure and implementation looks good to me, haven't checked the tests yet.
A fundamental discussion point I would like to raise is whether Vector should store Partition. In my original design, both Matrix and Vector exclusively used local indexing after they have been filled, and I believe this is a property we could use generally. If users need to translate global indices and local indices, they have the original partition available from their setup code. Using the vector and matrix as a convenient vehicle to carry them around seems like it might overload the functionality a bit.
Regarding the partition, my use case for that was repartitioning, which we should support in the long term. This could be achieved without storing the partition in the vector class, since the repartition would need to store it anyway. However, I don't see how we could make sure that the vector we are repartitioning is valid for that operation, if the vector does not store its partition. I think the additional safety we get from storing it is worth the downside. (I'm not entirely sure what you see as a downside, except that it is not necessary, perhaps you could elaborate that?) Also, I think there is a strong connection between vector and partition. The vector is only valid in combination with this one partition, which is expressed in that way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. Some minor questions on function interfaces.
Regarding the base class to merge this into, I think storing a partition in Vector also allows us to only Vector operations on data (with a different partition), without access to matrix or other partition storing classes. Therefore, I also think it would be useful to store the partition in the vector class. |
I've worked in most of the reviews. There are two open questions so far (1 minor, 1 major):
For 2. the arguments against (as I understand it) are:
The arguments in favor are:
|
b07fb4d
to
9cbb280
Compare
I've added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. Some comments on the whether we want to make DenseCache public. Regarding the discussion points:
- I think the constructor order is okay.
- I think a distributed object should have always have partition member.
- It should be something like a Concept
require
for distributed objects. - There are cases for vector operations, where the partition makes sense. For example dot products between two vectors.
- Additionally, the read function already takes a partition. In my opinion, it might make more sense to not have the read function take a partition, but create the object with a partition and the read function use the object's partition.
- It should be something like a Concept
@pratikvn you raise a very important point. The current implentations are only correct, if both vectors (think axpy, dot, ...) use the same partition. It is not enough that both local vectors have the same size. This is something that we should check for at some point. |
1def2dd
to
b9b3512
Compare
I see, there seems to be a very fundamental difference between our approaches. In my original implementation, neither Matrix nor Vector store their partition. If you ignore the off-diagonal entries, you can imagine the matrix as a block-diagonal matrix with local entries in each row. I agree that having the partition available is useful in some cases, but in that case the current implementation has similar issues to what #753 is fixing: Copying the vector to another executor doesn't copy the Partition, which leads to issues when trying to read it from a kernel. A very basic rule I try to follow in my designs is to try and keep the amount of state of each object as minimal as possible, and always prefer members with value semantics. This leads to the least weird empty states for objects. For |
While I agree that minimizing state stored by a class is a good idea, I think it is important to have enough information to enable the class to be independent, without having to rely on other classes to provide valid information on how the data is being stored in this class. To me, a partition is crucial to a distributed object and provides information on the object's distribution. Regarding the issues you raise:
|
I tend to disagree with the premise that a vector needs to contain information that allows you to understand what its row indices mean. Take Dense for example, without context (e.g. a FEM mesh), you don't know how to interpret the different entries. That context (mapping mesh entries to DOFs) is provided by the application and not stored in the vector. I see the distributed vector in the same way: The application provides a way to interpret what the local indices mean based on its FEM mesh and partition. Also a side note we haven't talked about: For the most efficient way to store a partition (i.e. one contiguous range for each rank), the local size is entirely sufficient to check for consistency (assuming the default MPI configuration to quit if one rank aborts) Instead of checking the global property "matching partition" everywhere, you check the local property "matching size" everywhere, which is equivalent, because in this case, the Finally, I may not have made the fine distinction between our current approach and L/T vectors clear enough:
EDIT: As a final thought, generally, it is always easier to add something later (store the partition after EDIT2: I just remembered another issue: If we store the partition in Vector and Matrix, we also need to propagate it to all intermediate vectors in solvers, as well as distributed preconditioner wrappers. That functionality might impact a lot of classes with distributed-specific code. EDIT3: After some discussions with Natalie, I believe the CEED case is pretty FEM-specific, since each nonzero involving the boundary of an element is computed from the sum of multiple submatrices involving the neighboring elements (see "E vector"), and such a decomposition into individual submatrices may not exist/be well-defined in the general case? |
- small rename - documentation - cmake - tests Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com>
Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
- formating - fix DenseCache::init_from - fix tests if comm.size != 3 Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com>
this adds in turn mutable access through get_local_values and at_local Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
6e9016e
to
72944ef
Compare
Note: This PR changes the Ginkgo ABI:
For details check the full ABI diff under Artifacts here |
Kudos, SonarCloud Quality Gate passed! |
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp. Related PR: #961 Related PR: #1030
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp. Related PR: #961 Related PR: #1030
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp. Related PR: #961 Related PR: #1030
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp. Related PR: #961 Related PR: #1030
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp. Related PR: #961 Related PR: #1030
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp. Related PR: #961 Related PR: #1030
This PR will enable using distributed matrices and vector (#971 and #961) in the following iterative solvers: - Bicgstab - Cg - Cgs - Fcg - Ir Currently not supported are: - Bicg - [cb_]Gmres - Idr - Multigrid - Lower/Upper_trs The handling of the distributed/non-distributed data is done via additional dispatch routines that expand on precision_dispatch_real_complex, and helper routines to extract the underlying dense matrix from either a distributed or dense vector. Also, the residual norm stopping criteria implementation has been changed to also use a similar dispatch approach. This also contains some fixes regarding the doxygen documentation for the other distributed classes. Related PR: #976
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp. Related PR: #961 Related PR: #1030
This PR will enable using distributed matrices and vector (#971 and #961) in the following iterative solvers: - Bicgstab - Cg - Cgs - Fcg - Ir Currently not supported are: - Bicg - [cb_]Gmres - Idr - Multigrid - Lower/Upper_trs The handling of the distributed/non-distributed data is done via additional dispatch routines that expand on precision_dispatch_real_complex, and helper routines to extract the underlying dense matrix from either a distributed or dense vector. Also, the residual norm stopping criteria implementation has been changed to also use a similar dispatch approach. This also contains some fixes regarding the doxygen documentation for the other distributed classes. Related PR: #976
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp. Related PR: #961 Related PR: #1030
This PR will enable using distributed matrices and vector (#971 and #961) in the following iterative solvers: - Bicgstab - Cg - Cgs - Fcg - Ir Currently not supported are: - Bicg - [cb_]Gmres - Idr - Multigrid - Lower/Upper_trs The handling of the distributed/non-distributed data is done via additional dispatch routines that expand on precision_dispatch_real_complex, and helper routines to extract the underlying dense matrix from either a distributed or dense vector. Also, the residual norm stopping criteria implementation has been changed to also use a similar dispatch approach. This also contains some fixes regarding the doxygen documentation for the other distributed classes. Related PR: #976
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp. Related PR: #961 Related PR: #1030
This PR will enable using distributed matrices and vector (#971 and #961) in the following iterative solvers: - Bicgstab - Cg - Cgs - Fcg - Ir Currently not supported are: - Bicg - [cb_]Gmres - Idr - Multigrid - Lower/Upper_trs The handling of the distributed/non-distributed data is done via additional dispatch routines that expand on precision_dispatch_real_complex, and helper routines to extract the underlying dense matrix from either a distributed or dense vector. Also, the residual norm stopping criteria implementation has been changed to also use a similar dispatch approach. This also contains some fixes regarding the doxygen documentation for the other distributed classes. Related PR: #976
This PR will add basic, distributed data structures (matrix and vector), and enable some solvers for these types. This PR contains the following PRs: - #961 - #971 - #976 - #985 - #1007 - #1030 - #1054 # Additional Changes - moves new types into experimental namespace - moves existing Partition class into experimental namespace - moves existing mpi namespace into experimental namespace - makes generic_scoped_device_id_guard destructor noexcept by terminating if restoring the original device id fails - switches to blocking communication in the SpMV if OpenMPI version 4.0.x is used - disables Horeka mpi tests and uses nla-gpu instead Related PR: #1133
Advertise release 1.5.0 and last changes + Add changelog, + Update third party libraries + A small fix to a CMake file See PR: #1195 The Ginkgo team is proud to announce the new Ginkgo minor release 1.5.0. This release brings many important new features such as: - MPI-based multi-node support for all matrix formats and most solvers; - full DPC++/SYCL support, - functionality and interface for GPU-resident sparse direct solvers, - an interface for wrapping solvers with scaling and reordering applied, - a new algebraic Multigrid solver/preconditioner, - improved mixed-precision support, - support for device matrix assembly, and much more. If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions). Supported systems and requirements: + For all platforms, CMake 3.13+ + C++14 compliant compiler + Linux and macOS + GCC: 5.5+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + NVHPC: 22.7+ + Cray Compiler: 14.0.1+ + CUDA module: CUDA 9.2+ or NVHPC 22.7+ + HIP module: ROCm 4.0+ + DPC++ module: Intel OneAPI 2021.3 with oneMKL and oneDPL. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: GCC 5.5+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.2+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add MPI-based multi-node for all matrix formats and solvers (except GMRES and IDR). ([#676](#676), [#908](#908), [#909](#909), [#932](#932), [#951](#951), [#961](#961), [#971](#971), [#976](#976), [#985](#985), [#1007](#1007), [#1030](#1030), [#1054](#1054), [#1100](#1100), [#1148](#1148)) + Porting the remaining algorithms (preconditioners like ISAI, Jacobi, Multigrid, ParILU(T) and ParIC(T)) to DPC++/SYCL, update to SYCL 2020, and improve support and performance ([#896](#896), [#924](#924), [#928](#928), [#929](#929), [#933](#933), [#943](#943), [#960](#960), [#1057](#1057), [#1110](#1110), [#1142](#1142)) + Add a Sparse Direct interface supporting GPU-resident numerical LU factorization, symbolic Cholesky factorization, improved triangular solvers, and more ([#957](#957), [#1058](#1058), [#1072](#1072), [#1082](#1082)) + Add a ScaleReordered interface that can wrap solvers and automatically apply reorderings and scalings ([#1059](#1059)) + Add a Multigrid solver and improve the aggregation based PGM coarsening scheme ([#542](#542), [#913](#913), [#980](#980), [#982](#982), [#986](#986)) + Add infrastructure for unified, lambda-based, backend agnostic, kernels and utilize it for some simple kernels ([#833](#833), [#910](#910), [#926](#926)) + Merge different CUDA, HIP, DPC++ and OpenMP tests under a common interface ([#904](#904), [#973](#973), [#1044](#1044), [#1117](#1117)) + Add a device_matrix_data type for device-side matrix assembly ([#886](#886), [#963](#963), [#965](#965)) + Add support for mixed real/complex BLAS operations ([#864](#864)) + Add a FFT LinOp for all but DPC++/SYCL ([#701](#701)) + Add FBCSR support for NVIDIA and AMD GPUs and CPUs with OpenMP ([#775](#775)) + Add CSR scaling ([#848](#848)) + Add array::const_view and equivalent to create constant matrices from non-const data ([#890](#890)) + Add a RowGatherer LinOp supporting mixed precision to gather dense matrix rows ([#901](#901)) + Add mixed precision SparsityCsr SpMV support ([#970](#970)) + Allow creating CSR submatrix including from (possibly discontinuous) index sets ([#885](#885), [#964](#964)) + Add a scaled identity addition (M <- aI + bM) feature interface and impls for Csr and Dense ([#942](#942)) Deprecations and important changes: + Deprecate AmgxPgm in favor of the new Pgm name. ([#1149](#1149)). + Deprecate specialized residual norm classes in favor of a common `ResidualNorm` class ([#1101](#1101)) + Deprecate CamelCase non-polymorphic types in favor of snake_case versions (like array, machine_topology, uninitialized_array, index_set) ([#1031](#1031), [#1052](#1052)) + Bug fix: restrict gko::share to rvalue references (*possible interface break*) ([#1020](#1020)) + Bug fix: when using cuSPARSE's triangular solvers, specifying the factory parameter `num_rhs` is now required when solving for more than one right-hand side, otherwise an exception is thrown ([#1184](#1184)). + Drop official support for old CUDA < 9.2 ([#887](#887)) Improved performance additions: + Reuse tmp storage in reductions in solvers and add a mutable workspace to all solvers ([#1013](#1013), [#1028](#1028)) + Add HIP unsafe atomic option for AMD ([#1091](#1091)) + Prefer vendor implementations for Dense dot, conj_dot and norm2 when available ([#967](#967)). + Tuned OpenMP SellP, COO, and ELL SpMV kernels for a small number of RHS ([#809](#809)) Fixes: + Fix various compilation warnings ([#1076](#1076), [#1183](#1183), [#1189](#1189)) + Fix issues with hwloc-related tests ([#1074](#1074)) + Fix include headers for GCC 12 ([#1071](#1071)) + Fix for simple-solver-logging example ([#1066](#1066)) + Fix for potential memory leak in Logger ([#1056](#1056)) + Fix logging of mixin classes ([#1037](#1037)) + Improve value semantics for LinOp types, like moved-from state in cross-executor copy/clones ([#753](#753)) + Fix some matrix SpMV and conversion corner cases ([#905](#905), [#978](#978)) + Fix uninitialized data ([#958](#958)) + Fix CUDA version requirement for cusparseSpSM ([#953](#953)) + Fix several issues within bash-script ([#1016](#1016)) + Fixes for `NVHPC` compiler support ([#1194](#1194)) Other additions: + Simplify and properly name GMRES kernels ([#861](#861)) + Improve pkg-config support for non-CMake libraries ([#923](#923), [#1109](#1109)) + Improve gdb pretty printer ([#987](#987), [#1114](#1114)) + Add a logger highlighting inefficient allocation and copy patterns ([#1035](#1035)) + Improved and optimized test random matrix generation ([#954](#954), [#1032](#1032)) + Better CSR strategy defaults ([#969](#969)) + Add `move_from` to `PolymorphicObject` ([#997](#997)) + Remove unnecessary device_guard usage ([#956](#956)) + Improvements to the generic accessor for mixed-precision ([#727](#727)) + Add a naive lower triangular solver implementation for CUDA ([#764](#764)) + Add support for int64 indices from CUDA 11 onward with SpMV and SpGEMM ([#897](#897)) + Add a L1 norm implementation ([#900](#900)) + Add reduce_add for arrays ([#831](#831)) + Add utility to simplify Dense View creation from an existing Dense vector ([#1136](#1136)). + Add a custom transpose implementation for Fbcsr and Csr transpose for unsupported vendor types ([#1123](#1123)) + Make IDR random initilization deterministic ([#1116](#1116)) + Move the algorithm choice for triangular solvers from Csr::strategy_type to a factory parameter ([#1088](#1088)) + Update CUDA archCoresPerSM ([#1175](#1116)) + Add kernels for Csr sparsity pattern lookup ([#994](#994)) + Differentiate between structural and numerical zeros in Ell/Sellp ([#1027](#1027)) + Add a binary IO format for matrix data ([#984](#984)) + Add a tuple zip_iterator implementation ([#966](#966)) + Simplify kernel stubs and declarations ([#888](#888)) + Simplify GKO_REGISTER_OPERATION with lambdas ([#859](#859)) + Simplify copy to device in tests and examples ([#863](#863)) + More verbose output to array assertions ([#858](#858)) + Allow parallel compilation for Jacobi kernels ([#871](#871)) + Change clang-format pointer alignment to left ([#872](#872)) + Various improvements and fixes to the benchmarking framework ([#750](#750), [#759](#759), [#870](#870), [#911](#911), [#1033](#1033), [#1137](#1137)) + Various documentation improvements ([#892](#892), [#921](#921), [#950](#950), [#977](#977), [#1021](#1021), [#1068](#1068), [#1069](#1069), [#1080](#1080), [#1081](#1081), [#1108](#1108), [#1153](#1153), [#1154](#1154)) + Various CI improvements ([#868](#868), [#874](#874), [#884](#884), [#889](#889), [#899](#899), [#903](#903), [#922](#922), [#925](#925), [#930](#930), [#936](#936), [#937](#937), [#958](#958), [#882](#882), [#1011](#1011), [#1015](#1015), [#989](#989), [#1039](#1039), [#1042](#1042), [#1067](#1067), [#1073](#1073), [#1075](#1075), [#1083](#1083), [#1084](#1084), [#1085](#1085), [#1139](#1139), [#1178](#1178), [#1187](#1187))
This PR adds a distributed (multi-)vector class. I've opted into using mostly multivector based descriptions instead of using dense for that. This might clash a bit with the rest, until we separate dense into multivector and dense matrix.
The vector is a distributed global vector, i.e. it has the size
global_num_rows x num_cols
. Only the rows are distributed. Each process stores only the rows that belong to itself inDense
matrix, according to the used partition.Regarding the tests, I've put the tests that require MPI into
/core/test/mpi/distributed
, not sure if that is the best place.Partially addresses #907
Note: should we use
distributed-develop
as target branch again? That way we could make sure that we have all basic things (vectors, matrices, solvers) together for the next release.Main contributions are from @upsj and @pratikvn.