More on Gpu kernel fusing #1332

WeiqunZhang · 2020-09-03T00:43:20Z

Summary

Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate
whether the kernel is an candidate for fusing.
For MFIter, if the local size is less or equal to 3, the fuse region is
turned on and small kernels marked fusable will be fused.
Add launch macros for fusing.
Add fusing to a number of functions used by linear solvers. Note that
there are a lot more amrex functions need to be updated for fusing.
Optimize reduction for bottom solve.
Consolidate memcpy in communication functions.
Option to use device memory in communication kernels for packing and
unpacking buffers. But it's currently turned off because the performance
was not improved in testing. In fact, it was worse than using pinned
memory. But this might change in the future. So the option is kept.

Checklist

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
are described in the proposed changes to the AMReX documentation, if appropriate

WeiqunZhang · 2020-09-03T00:50:01Z

This is still a draft. I submitted this to run CI and I like the diff provided by github.

I have done scaling test using amrex/Tutorials/LinearSolvers/ABecLaplacian_C on summit with 256^3 cells per node recently. The results are shown below.

| Branch      | # of node |   Time |
|-------------+-----------+--------|
| development |         1 | 0.2557 |
| fusing      |         1 | 0.2204 |
|-------------+-----------+--------|
| development |         8 | 0.3232 |
| fusing      |         8 | 0.2820 |
|-------------+-----------+--------|
| development |        64 | 0.3725 |
| fusing      |        64 | 0.3284 |
|-------------+-----------+--------|
| development |       512 | 0.4703 |
| fusing      |       512 | 0.4063 |
|-------------+-----------+--------|
| development |      4096 | 0.6459 |
| fusing      |      4096 | 0.5769 |

WeiqunZhang · 2020-09-08T05:19:03Z

This PR is ready for review now. I have done regression tests on AMReX and various application codes on both GPU and CPU. It has passed all tests except for a few CPU tests using cell-centered linear solvers. There are roundoff errors in these tests due to a change in MultiFab::Dot. See https://ccse.lbl.gov/pub/RegressionTesting/IAMR/2020-09-07-001/index.html for example. In a different branch, fusing_regtest, I reverted the CPU path in that function to the current way in development. The "failed" tests passes with the fusing_regtest. The difference of the PR branch fusing and fusing_regtest can be seen here, WeiqunZhang/amrex@fusing...WeiqunZhang:fusing_regtest.

* Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate whether the kernel is an candidate for fusing. * For MFIter, if the local size is less or equal to 3, the fuse region is turned on and small kernels marked fusable will be fused. * Add launch macros for fusing. * Add fusing to a number of functions used by linear solvers. Note that there are a lot more amrex functions need to be updated for fusing. * Optimize reduction for bottom solve. * Consolidate memcpy in communication functions. * Option to use device memory in communication kernels for packing and unpacking buffers. But it's currently turned off because the performance was not improved in testing. In fact, it was worse than using pinned memory. But this might change in the future. So the option is kept.

Src/Base/AMReX_GpuLaunchFunctsG.H

Src/Base/AMReX_GpuLaunch.H

Src/Base/AMReX_GpuFuse.cpp

kngott · 2020-09-14T21:24:48Z

Src/Base/AMReX_FBI.H

+    auto d_tags = reinterpret_cast<TagType*>(d_buffer);
+    auto d_nwarps = reinterpret_cast<int*>(d_buffer+offset_nwarps);
+
+    constexpr int nthreads = 256;


Was the increase in nthreads tested? Or just changing to be consistent with other launches?

Should this be based on MAX_NUM_THREADS or another more universal variable, so it can be easily adjusted for each architecture type without being forgotten?

Yes, it was tested. But I didn't really see much difference.

## Summary * Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate whether the kernel is an candidate for fusing. * For MFIter, if the local size is less or equal to 3, the fuse region is turned on and small kernels marked fusable will be fused. * Add launch macros for fusing. * Add fusing to a number of functions used by linear solvers. Note that there are a lot more amrex functions need to be updated for fusing. * Optimize reduction for bottom solve. * Consolidate memcpy in communication functions. * Option to use device memory in communication kernels for packing and unpacking buffers. But it's currently turned off because the performance was not improved in testing. In fact, it was worse than using pinned memory. But this might change in the future. So the option is kept. ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [x] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] are described in the proposed changes to the AMReX documentation, if appropriate

This reverts commit 4091007.

The most vexing parse in C++ strikes again. This closes AMReX-Codes#1422.

The most vexing parse in C++ strikes again. This closes #1422.

## Summary * Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate whether the kernel is an candidate for fusing. * For MFIter, if the local size is less or equal to 3, the fuse region is turned on and small kernels marked fusable will be fused. * Add launch macros for fusing. * Add fusing to a number of functions used by linear solvers. Note that there are a lot more amrex functions need to be updated for fusing. * Optimize reduction for bottom solve. * Consolidate memcpy in communication functions. * Option to use device memory in communication kernels for packing and unpacking buffers. But it's currently turned off because the performance was not improved in testing. In fact, it was worse than using pinned memory. But this might change in the future. So the option is kept. ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [x] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] are described in the proposed changes to the AMReX documentation, if appropriate

…1423) The most vexing parse in C++ strikes again. This closes AMReX-Codes#1422.

WeiqunZhang marked this pull request as draft September 3, 2020 00:43

WeiqunZhang requested review from atmyers, maximumcats and kngott September 3, 2020 00:43

WeiqunZhang force-pushed the fusing branch 18 times, most recently from 0a69f34 to 2175c0f Compare September 7, 2020 23:02

WeiqunZhang marked this pull request as ready for review September 8, 2020 05:08

WeiqunZhang force-pushed the fusing branch from 2175c0f to 2171064 Compare September 11, 2020 02:56

WeiqunZhang force-pushed the fusing branch from 2171064 to 56d57d5 Compare September 11, 2020 15:29