Add Profiler annotation logger #1055

upsj · 2022-06-17T17:29:57Z

The logger adds timers for all important Ginkgo events based on the perfstubs library, VTune, nvTX and rocTX. To make it work, we also need to forward all logger events to the executor. This is meant to show how we can annotate a profile with Ginkgo-specific information.

An example profile from 10000 iterations in preconditioned-solver.cpp.
Not sure what's going on with spmv#3, but this may be related to inclusive vs. exclusive timing.

Below, you can see the same example running with NSight Systems and NVTX ranges.

tcojean

LGTM.

They also have this dynamic phase thing, I don't know if we can use this for solver iterations or what it does?
https://github.com/UO-OACISS/perfstubs/blob/master/perfstubs_api/timer.h#L136

I'm also not too happy to propagate everything to the executor as that defeats the purpose of the split system we have as that will make users rely much more on event masks to get only a subset of the events, instead of being responsibility based. But that is more of a preference than a requirement.

examples/preconditioned-solver/preconditioned-solver.cpp

include/ginkgo/core/log/tau.hpp

examples/preconditioned-solver/preconditioned-solver.cpp

upsj · 2022-12-08T21:17:02Z

format-rebase!

ginkgo-bot · 2022-12-08T21:17:44Z

Formatting rebase introduced changes, see Artifacts here to review them

MarcelKoch

I think we should discuss, why this uses global loggers. As far as I can see, the logger itself does not to be a global object. Each hook just calls a function, so there is no need for global state. Only for Tau we need some global state, but that can be kept minimal.
So I think the global state is only used by the decision to propagate all loggers to the executor. Perhaps it's possible to achieve that without global state. Maybe we could specialize EnableLogging for Executor and add a counter of attached e.g. PropagatedLoggers to check if propagation is necessary.

core/log/profiler_hook.cpp

upsj · 2022-12-09T10:10:08Z

I'm very open to other solutions to the global initialization problem - we have a similar issue with things like MPI_Init/Finalize, where users may or may not want us to take care of initialization. I do like your suggestion a lot, I didn't consider letting the Executor decide whether to receive propagated events. We could potentially also extend that to handle the split between host and device executor, making the workaround for the benchmarks unnecessary? But that would make the host executor the destination for all events, which is usually not that easy to access compared to the device executor, which is used everywhere.

upsj · 2022-12-09T10:48:56Z

On the global object question: I didn't realize until now, in case somebody were to run some kind of weird hybrid solve, it might actually make sense to have multiple different loggers with different properties all providing NVTX ranges. I'll add a bit of color to the nvtx example :)

upsj · 2022-12-09T11:13:19Z

Not using singletons for nvtx and roctx definitely makes sense: We can assign different colors to different executors that way (with a bit of corporate branding 😄 ): Yellow is GPU, gray is CPU

cmake/Modules/FindROCTX.cmake

cmake/hip.cmake

include/ginkgo/core/log/logger.hpp

include/ginkgo/core/log/profiler_hook.hpp

examples/preconditioned-solver/preconditioned-solver.cpp

core/log/profiler_hook.cpp

upsj · 2022-12-09T22:08:03Z

Random thought: A future feature that would work especially well for NVTX would be allowing users to give names to objects: profiler_set_name(linop.get(), "mass_matrix") would probably make the profile much more readable, and be very cheap to implement

upsj · 2022-12-12T11:14:34Z

ROCm profiling also seems to work :)

the logger adds timers for all important Ginkgo events. this PR also forwards all logger events to the executor.

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

* allow synchronization for debugging purposes * use singleton for TAU again * add Executor-dependent create function * add documentation * provide profiling scope guards for internal use * automatically determine whether to propagate * add tests

* improve comments * make TAU logger non-singleton Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

- improve documentation - add tests - use shared_ptr Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com>

ginkgo-bot · 2023-02-02T11:28:49Z

Note: This PR changes the Ginkgo ABI:

Functions changes summary: 0 Removed, 291 Changed (57151 filtered out), 306 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable

For details check the full ABI diff under Artifacts here

sonarqubecloud · 2023-02-02T22:52:14Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
25 Code Smells

67.5% Coverage
3.1% Duplication

Release 1.6.0 of Ginkgo. The Ginkgo team is proud to announce the new Ginkgo minor release 1.6.0. This release brings new features such as: - Several building blocks for GPU-resident sparse direct solvers like symbolic and numerical LU and Cholesky factorization, ..., - A distributed Schwarz preconditioner, - New FGMRES and GCR solvers, - Distributed benchmarks for the SpMV operation, solvers, ... - Support for non-default streams in the CUDA and HIP backends, - Mixed precision support for the CSR SpMV, - A new profiling logger which integrates with NVTX, ROCTX, TAU and VTune to provide internal Ginkgo knowledge to most HPC profilers! and much more. If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions). Supported systems and requirements: + For all platforms, CMake 3.13+ + C++14 compliant compiler + Linux and macOS + GCC: 5.5+ + clang: 3.9+ + Intel compiler: 2018+ + Apple Clang: 14.0 is tested. Earlier versions might also work. + NVHPC: 22.7+ + Cray Compiler: 14.0.1+ + CUDA module: CUDA 9.2+ or NVHPC 22.7+ + HIP module: ROCm 4.5+ + DPC++ module: Intel OneAPI 2021.3+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp`. + Windows + MinGW: GCC 5.5+ + Microsoft Visual Studio: VS 2019+ + CUDA module: CUDA 9.2+, Microsoft Visual Studio + OpenMP module: MinGW. ### Version Support Changes + ROCm 4.0+ -> 4.5+ after [#1303](#1303) + Removed Cygwin pipeline and support [#1283](#1283) ### Interface Changes + Due to internal changes, `ConcreteExecutor::run` will now always throw if the corresponding module for the `ConcreteExecutor` is not build [#1234](#1234) + The constructor of `experimental::distributed::Vector` was changed to only accept local vectors as `std::unique_ptr` [#1284](#1284) + The default parameters for the `solver::MultiGrid` were improved. In particular, the smoother defaults to one iteration of `Ir` with `Jacobi` preconditioner, and the coarse grid solver uses the new direct solver with LU factorization. [#1291](#1291) [#1327](#1327) + The `iteration_complete` event gained a more expressive overload with additional parameters, the old overloads were deprecated. [#1288](#1288) [#1327](#1327) ### Deprecations + Deprecated less expressive `iteration_complete` event. Users are advised to now implement the function `void iteration_complete(const LinOp* solver, const LinOp* b, const LinOp* x, const size_type& it, const LinOp* r, const LinOp* tau, const LinOp* implicit_tau_sq, const array<stopping_status>* status, bool stopped)` [#1288](#1288) ### Added Features + A distributed Schwarz preconditioner. [#1248](#1248) + A GCR solver [#1239](#1239) + Flexible Gmres solver [#1244](#1244) + Enable Gmres solver for distributed matrices and vectors [#1201](#1201) + An example that uses Kokkos to assemble the system matrix [#1216](#1216) + A symbolic LU factorization allowing the `gko::experimental::factorization::Lu` and `gko::experimental::solver::Direct` classes to be used for matrices with non-symmetric sparsity pattern [#1210](#1210) + A numerical Cholesky factorization [#1215](#1215) + Symbolic factorizations in host-side operations are now wrapped in a host-side `Operation` to make their execution visible to loggers. This means that profiling loggers and benchmarks are no longer missing a separate entry for their runtime [#1232](#1232) + Symbolic factorization benchmark [#1302](#1302) + The `ProfilerHook` logger allows annotating the Ginkgo execution (apply, operations, ...) for profiling frameworks like NVTX, ROCTX and TAU. [#1055](#1055) + `ProfilerHook::created_(nested_)summary` allows the generation of a lightweight runtime profile over all Ginkgo functions written to a user-defined stream [#1270](#1270) for both host and device timing functionality [#1313](#1313) + It is now possible to enable host buffers for MPI communications at runtime even if the compile option `GINKGO_FORCE_GPU_AWARE_MPI` is set. [#1228](#1228) + A stencil matrices generator (5-pt, 7-pt, 9-pt, and 27-pt) for benchmarks [#1204](#1204) + Distributed benchmarks (multi-vector blas, SpMV, solver) [#1204](#1204) + Benchmarks for CSR sorting and lookup [#1219](#1219) + A timer for MPI benchmarks that reports the longest time [#1217](#1217) + A `timer_method=min|max|average|median` flag for benchmark timing summary [#1294](#1294) + Support for non-default streams in CUDA and HIP executors [#1236](#1236) + METIS integration for nested dissection reordering [#1296](#1296) + SuiteSparse AMD integration for fillin-reducing reordering [#1328](#1328) + Csr mixed-precision SpMV support [#1319](#1319) + A `with_loggers` function for all `Factory` parameters [#1337](#1337) ### Improvements + Improve naming of kernel operations for loggers [#1277](#1277) + Annotate solver iterations in `ProfilerHook` [#1290](#1290) + Allow using the profiler hooks and inline input strings in benchmarks [#1342](#1342) + Allow passing smart pointers in place of raw pointers to most matrix functions. This means that things like `vec->compute_norm2(x.get())` or `vec->compute_norm2(lend(x))` can be simplified to `vec->compute_norm2(x)` [#1279](#1279) [#1261](#1261) + Catch overflows in prefix sum operations, which makes Ginkgo's operations much less likely to crash. This also improves the performance of the prefix sum kernel [#1303](#1303) + Make the installed GinkgoConfig.cmake file relocatable and follow more best practices [#1325](#1325) ### Fixes + Fix OpenMPI version check [#1200](#1200) + Fix the mpi cxx type binding by c binding [#1306](#1306) + Fix runtime failures for one-sided MPI wrapper functions observed on some OpenMPI versions [#1249](#1249) + Disable thread pinning with GPU executors due to poor performance [#1230](#1230) + Fix hwloc version detection [#1266](#1266) + Fix PAPI detection in non-implicit include directories [#1268](#1268) + Fix PAPI support for newer PAPI versions: [#1321](#1321) + Fix pkg-config file generation for library paths outside prefix [#1271](#1271) + Fix various build failures with ROCm 5.4, CUDA 12, and OneAPI 6 [#1214](#1214), [#1235](#1235), [#1251](#1251) + Fix incorrect read for skew-symmetric MatrixMarket files with explicit diagonal entries [#1272](#1272) + Fix handling of missing diagonal entries in symbolic factorizations [#1263](#1263) + Fix segmentation fault in benchmark matrix construction [#1299](#1299) + Fix the stencil matrix creation for benchmarking [#1305](#1305) + Fix the additional residual check in IR [#1307](#1307) + Fix the cuSPARSE CSR SpMM issue on single strided vector when cuda >= 11.6 [#1322](#1322) [#1331](#1331) + Fix Isai generation for large sparsity powers [#1327](#1327) + Fix Ginkgo compilation and test with NVHPC >= 22.7 [#1331](#1331) + Fix Ginkgo compilation of 32 bit binaries with MSVC [#1349](#1349)

upsj added the 1:ST:need-feedback The PR is somewhat ready but feedback on a blocking topic is required before a proper review. label Jun 17, 2022

upsj requested review from tcojean and a team June 17, 2022 17:29

upsj self-assigned this Jun 17, 2022

ginkgo-bot added mod:core This is related to the core module. reg:build This is related to the build system. reg:example This is related to the examples. labels Jun 17, 2022

tcojean previously approved these changes Jul 12, 2022

View reviewed changes

upsj linked an issue Jul 13, 2022 that may be closed by this pull request

Add support for profiler hooks in logger #1036

Closed

upsj force-pushed the tau_logger branch 2 times, most recently from 13f2c38 to f475279 Compare December 7, 2022 16:44

upsj changed the title ~~Add TAU logger~~ Add Profiler annotation logger Dec 7, 2022

upsj requested review from tcojean and a team December 8, 2022 18:48

upsj force-pushed the tau_logger branch from f475279 to 515ac9e Compare December 8, 2022 21:11

upsj added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:need-feedback The PR is somewhat ready but feedback on a blocking topic is required before a proper review. labels Dec 8, 2022

ginkgo-bot force-pushed the tau_logger branch from 515ac9e to 3207363 Compare December 8, 2022 21:17

MarcelKoch self-requested a review December 9, 2022 09:01

MarcelKoch reviewed Dec 9, 2022

View reviewed changes

core/log/profiler_hook.cpp Outdated Show resolved Hide resolved

yhmtsai reviewed Dec 9, 2022

View reviewed changes

upsj mentioned this pull request Dec 12, 2022

Use RAII scope guards for logging begin/end pairs #1238

Closed

upsj and others added 23 commits February 2, 2023 10:36

add TAU logger

abea028

the logger adds timers for all important Ginkgo events. this PR also forwards all logger events to the executor.

fix typo

01a97f3

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

extract color constant

f5f1bf3

fix compilation issue with disabled CUDA

289336b

make ROCTX optional

25e1a60

format

cebb55a

handle event propagation on the Executor level, add named objects

82ef090

make ProfilerHook public, move state inside

c829fbf

design update

2f28892

* allow synchronization for debugging purposes * use singleton for TAU again * add Executor-dependent create function * add documentation * provide profiling scope guards for internal use * automatically determine whether to propagate * add tests

use find module for NVTX

ea73401

enable propagation for profiler hook

36b0797

support nvtx3 and legacy nvtx simultaneously

17cc0d6

use correct way of including nvtx3

fa652ba

formatting

4b0dbc0

fix NVTX in statically linked Ginkgo

5467287

review updates

aa4cc57

* improve comments * make TAU logger non-singleton Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

simplify exec synchronization, return unique_ptr

97195dc

Co-authored-by: Marcel Koch <marcel.koch@kit.edu>

link against dl in static builds

4888146

fix static ROCm builds

692f623

fix roctx static build

0cc9ab6

add VTune

d0a1b50

fix roctx pkg-config setup

2304535

review updates

159bd83

- improve documentation - add tests - use shared_ptr Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com>

upsj force-pushed the tau_logger branch from f3bed43 to 159bd83 Compare February 2, 2023 09:37

upsj merged commit a020111 into develop Feb 2, 2023

upsj deleted the tau_logger branch February 2, 2023 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Profiler annotation logger #1055

Add Profiler annotation logger #1055

upsj commented Jun 17, 2022 •

edited

Loading

tcojean left a comment •

edited

Loading

upsj commented Dec 8, 2022

ginkgo-bot commented Dec 8, 2022

MarcelKoch left a comment

upsj commented Dec 9, 2022

upsj commented Dec 9, 2022

upsj commented Dec 9, 2022

upsj commented Dec 9, 2022 •

edited

Loading

upsj commented Dec 12, 2022

ginkgo-bot commented Feb 2, 2023

sonarqubecloud bot commented Feb 2, 2023

Add Profiler annotation logger #1055

Add Profiler annotation logger #1055

Conversation

upsj commented Jun 17, 2022 • edited Loading

tcojean left a comment • edited Loading

Choose a reason for hiding this comment

upsj commented Dec 8, 2022

ginkgo-bot commented Dec 8, 2022

MarcelKoch left a comment

Choose a reason for hiding this comment

upsj commented Dec 9, 2022

upsj commented Dec 9, 2022

upsj commented Dec 9, 2022

upsj commented Dec 9, 2022 • edited Loading

upsj commented Dec 12, 2022

ginkgo-bot commented Feb 2, 2023

sonarqubecloud bot commented Feb 2, 2023

upsj commented Jun 17, 2022 •

edited

Loading

tcojean left a comment •

edited

Loading

upsj commented Dec 9, 2022 •

edited

Loading