Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelizing pcgsolve with openmp #1154

Open
zLi90 opened this issue Oct 26, 2021 · 12 comments
Open

Parallelizing pcgsolve with openmp #1154

zLi90 opened this issue Oct 26, 2021 · 12 comments

Comments

@zLi90
Copy link

zLi90 commented Oct 26, 2021

We are using the provided pcg solver in kokkoskernels for our hydrological model. We found that when we activate openmp, the scaling does not look good.

We simply call run_pcg<Kokkos::OpenMP>(domsub.nCellDomain, statesub, domsub, timers); and in run_pcg, we call:

KokkosKernels::Experimental::Example::pcgsolve(kh, matA, vecB, vecX, cg_iteration_limit, cg_iteration_tolerance, & cg_result, true, clusterSize, useSequential);
Kokkos::fence();

The pcgsolve takes about 90% of the computational time and it doesn't scale well when we add threads. But since it is provided by kokkoskernels, it is difficult for us to find out what might be the reason. We were wondering if there are additional steps we are missing that can enhance scaling performance of the pcg solver?

We also want to ask if you have any recommendations on choosing an linear system solver (perhaps also a nonlinear solver) that is compatible with kokkos and has good scaling performance when using openmp and mpi?

Here I attached a figure that shows the speedup of our model (the blue dots), using the pcgsolve. Thank you so much in advance!
Screen Shot 2021-10-26 at 8 42 24 PM

@srajama1
Copy link
Contributor

@jennloe

@lucbv
Copy link
Contributor

lucbv commented Oct 28, 2021

Actually a good start would be to re-run this with finer timers, for instance using Kokkos-tools simple-kernel-timer. You just need to build the tool and then set: export KOKKOS_PROFILE_LIBRARY=${HOME}/kokkos-tools/profiling/simple-kernel-timer/kp_kernel_timer.so

@jennloe
Copy link
Contributor

jennloe commented Oct 29, 2021

I will start looking at this...

@jennloe
Copy link
Contributor

jennloe commented Oct 29, 2021

@zLi90 What size problem were you running on to get these timings? How sparse is the matrix? What was the convergence tolerance? How many iterations were you running? And what size of problem do you ultimately want to be able to run?

@zLi90
Copy link
Author

zLi90 commented Oct 29, 2021

@jennloe We are running a 3D transient subsurface flow model. This test case has 40x24x32=30720 grid cells, resulting a 7-diagonal matrix. The tolerance is 1e-8 and it takes (more or less) 18 pcg iterations for each time step (we ran a total of 900 seconds with a time step dt=60s). We hope to be able to run large scale hydrological simulations that contain 1 million grid cells or more. Thanks!

@zLi90
Copy link
Author

zLi90 commented Dec 8, 2021

Hi @jennloe , just wondering if you have figured out something? Do you have any suggestions on what we could try next? Thanks!

@jennloe
Copy link
Contributor

jennloe commented Dec 10, 2021

I'm sorry; I have not had more time to look at this, and I will not have time in the next few weeks. Thank you for your patience. @brian-kelley would you be able to look into this scaling?

@srajama1
Copy link
Contributor

I like Luc's comment above. @zLi90 can you enable the finer timers please?

@zLi90
Copy link
Author

zLi90 commented Dec 12, 2021

@srajama1 @lucbv Thanks for the suggestion! I have tried the simple-kernel-timer but I am not sure how to decode the output file (or perhaps I did something wrong?)

For example, I got an output file with the first few lines like this:

���ÄÜ7�@h���(���Kokkos::Impl::host_space_deepcopy_doublei����������� ∞Ç?_deepcop����O���/���Kokkos::View::initialization [Diag for Seq SOR]�������������#^?tion [Di����B���"���Kokkos::View::initialization [RHS]������������Ä?a?tion [RH����D���$���Kokkos::View::initialization [Vvoid]�������������@+?tion [Vv����E���%���Kokkos::View::initialization [cg::Ap]��������������a?tion [cg����D���$���Kokkos::View::initialization [cg::p]������������Ķ?tion [cg����D���$���Kokkos::View::initialization [cg::r]�������������!a?tion [cg����C���#���Kokkos::View::initialization [data]�������������X7?tion [da����C���#���Kokkos::View::initialization [dgw0]�������������zH?tion [dg����C���#���Kokkos::View::initialization [dgw1]�������������®I?tion [dg����C���#���Kokkos::View::initialization [dgw2]�������������úN?tion [dg����A���!���Kokkos::View::initialization [dh]�������������@˘>tion [dh����C���#���Kokkos::View::initialization [dsw0]�������������

@zLi90
Copy link
Author

zLi90 commented Dec 12, 2021

Ok I found the reader provided. This is useful indeed! Here are the top 5 time-consuming sections:

with 1 thread:

  • SubsurfaceModel::computeFaceConductivity(SubsurfaceState&, DomainSubsurface&)::{lambda(int)Story: Make KokkosKernels build independently of Trilinos #1}
    (ParFor) 1.248947 30 0.041632 29.454716 23.021248
  • SubsurfaceModel::wcRedistribution(SubsurfaceState&, DomainSubsurface&)::{lambda(int)Story: Make KokkosKernels build independently of Trilinos #1}
    (ParFor) 1.213125 15 0.080875 28.609900 22.360956
  • KokkosSparse::spmv<NoTranspose,Static>
    (ParFor) 0.987691 231 0.004276 23.293334 18.205629
  • KokkosBlas::Axpby::S15
    (ParFor) 0.214088 663 0.000323 5.048977 3.946185
  • KokkosBlas::dot<1D>
    (ParRed) 0.205637 678 0.000303 4.849678 3.790417

with 2 threads:

  • SubsurfaceModel::computeFaceConductivity(SubsurfaceState&, DomainSubsurface&)::{lambda(int)Story: Make KokkosKernels build independently of Trilinos #1}
    (ParFor) 0.626923 30 0.020897 28.184585 18.427453
  • SubsurfaceModel::wcRedistribution(SubsurfaceState&, DomainSubsurface&)::{lambda(int)Story: Make KokkosKernels build independently of Trilinos #1}
    (ParFor) 0.585216 15 0.039014 26.309592 17.201558
  • KokkosSparse::spmv<NoTranspose,Static>
    (ParFor) 0.498451 231 0.002158 22.408882 14.651222
  • KokkosBlas::Axpby::S15
    (ParFor) 0.142773 663 0.000215 6.418669 4.196611
  • KokkosBlas::dot<1D>
    (ParRed) 0.141340 678 0.000208 6.354218 4.154472

It seems that the last two items (Axpby and dot) don't scale well (from 0.2s to 0.14s)? I believe they belong to the pcgsolve of KokkosKernels.

@srajama1
Copy link
Contributor

@zLi90 Do you compile with any BLAS enabled? We have the implementation routines of in Kokkos Kernels. However, one should use vendor BLAS when available. What is the platform and compiler combination?

@zLi90
Copy link
Author

zLi90 commented Dec 20, 2021

@srajama1 I am using Mac+gcc. I just use the default settings when building kokkos kernels. Do you mean I should add -DKokkosKernels_ENABLE_TPL_BLAS=ON when building?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants