Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More on Gpu kernel fusing #1332

Merged
merged 3 commits into from
Sep 21, 2020
Merged

Conversation

WeiqunZhang
Copy link
Member

@WeiqunZhang WeiqunZhang commented Sep 3, 2020

Summary

  • Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate
    whether the kernel is an candidate for fusing.

  • For MFIter, if the local size is less or equal to 3, the fuse region is
    turned on and small kernels marked fusable will be fused.

  • Add launch macros for fusing.

  • Add fusing to a number of functions used by linear solvers. Note that
    there are a lot more amrex functions need to be updated for fusing.

  • Optimize reduction for bottom solve.

  • Consolidate memcpy in communication functions.

  • Option to use device memory in communication kernels for packing and
    unpacking buffers. But it's currently turned off because the performance
    was not improved in testing. In fact, it was worse than using pinned
    memory. But this might change in the future. So the option is kept.

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • are described in the proposed changes to the AMReX documentation, if appropriate

@WeiqunZhang WeiqunZhang marked this pull request as draft September 3, 2020 00:43
@WeiqunZhang
Copy link
Member Author

This is still a draft. I submitted this to run CI and I like the diff provided by github.

I have done scaling test using amrex/Tutorials/LinearSolvers/ABecLaplacian_C on summit with 256^3 cells per node recently. The results are shown below.

| Branch      | # of node |   Time |
|-------------+-----------+--------|
| development |         1 | 0.2557 |
| fusing      |         1 | 0.2204 |
|-------------+-----------+--------|
| development |         8 | 0.3232 |
| fusing      |         8 | 0.2820 |
|-------------+-----------+--------|
| development |        64 | 0.3725 |
| fusing      |        64 | 0.3284 |
|-------------+-----------+--------|
| development |       512 | 0.4703 |
| fusing      |       512 | 0.4063 |
|-------------+-----------+--------|
| development |      4096 | 0.6459 |
| fusing      |      4096 | 0.5769 |

@WeiqunZhang WeiqunZhang force-pushed the fusing branch 18 times, most recently from 0a69f34 to 2175c0f Compare September 7, 2020 23:02
@WeiqunZhang WeiqunZhang marked this pull request as ready for review September 8, 2020 05:08
@WeiqunZhang
Copy link
Member Author

This PR is ready for review now. I have done regression tests on AMReX and various application codes on both GPU and CPU. It has passed all tests except for a few CPU tests using cell-centered linear solvers. There are roundoff errors in these tests due to a change in MultiFab::Dot. See https://ccse.lbl.gov/pub/RegressionTesting/IAMR/2020-09-07-001/index.html for example. In a different branch, fusing_regtest, I reverted the CPU path in that function to the current way in development. The "failed" tests passes with the fusing_regtest. The difference of the PR branch fusing and fusing_regtest can be seen here, WeiqunZhang/amrex@fusing...WeiqunZhang:fusing_regtest.

* Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate
  whether the kernel is an candidate for fusing.

* For MFIter, if the local size is less or equal to 3, the fuse region is
  turned on and small kernels marked fusable will be fused.

* Add launch macros for fusing.

* Add fusing to a number of functions used by linear solvers.  Note that
  there are a lot more amrex functions need to be updated for fusing.

* Optimize reduction for bottom solve.

* Consolidate memcpy in communication functions.

* Option to use device memory in communication kernels for packing and
  unpacking buffers.  But it's currently turned off because the performance
  was not improved in testing.  In fact, it was worse than using pinned
  memory.  But this might change in the future.  So the option is kept.
auto d_tags = reinterpret_cast<TagType*>(d_buffer);
auto d_nwarps = reinterpret_cast<int*>(d_buffer+offset_nwarps);

constexpr int nthreads = 256;
Copy link
Contributor

@kngott kngott Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the increase in nthreads tested? Or just changing to be consistent with other launches?

Should this be based on MAX_NUM_THREADS or another more universal variable, so it can be easily adjusted for each architecture type without being forgotten?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was tested. But I didn't really see much difference.

@WeiqunZhang WeiqunZhang merged commit 1cec808 into AMReX-Codes:development Sep 21, 2020
@WeiqunZhang WeiqunZhang deleted the fusing branch September 21, 2020 15:33
kweide pushed a commit to ECP-Astro/amrex that referenced this pull request Sep 28, 2020
## Summary

* Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate
  whether the kernel is an candidate for fusing.

* For MFIter, if the local size is less or equal to 3, the fuse region is
  turned on and small kernels marked fusable will be fused.

* Add launch macros for fusing.

* Add fusing to a number of functions used by linear solvers.  Note that
  there are a lot more amrex functions need to be updated for fusing.

* Optimize reduction for bottom solve.

* Consolidate memcpy in communication functions.

* Option to use device memory in communication kernels for packing and
  unpacking buffers.  But it's currently turned off because the performance
  was not improved in testing.  In fact, it was worse than using pinned
  memory.  But this might change in the future.  So the option is kept.

## Checklist

The proposed changes:
- [ ] fix a bug or incorrect behavior in AMReX
- [x] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX users
- [ ] are described in the proposed changes to the AMReX documentation, if appropriate
kweide added a commit to ECP-Astro/amrex that referenced this pull request Sep 28, 2020
WeiqunZhang added a commit to WeiqunZhang/amrex that referenced this pull request Sep 29, 2020
The most vexing parse in C++ strikes again.

This closes AMReX-Codes#1422.
WeiqunZhang added a commit that referenced this pull request Sep 29, 2020
The most vexing parse in C++ strikes again.

This closes #1422.
dwillcox pushed a commit to dwillcox/amrex that referenced this pull request Oct 3, 2020
## Summary

* Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate
  whether the kernel is an candidate for fusing.

* For MFIter, if the local size is less or equal to 3, the fuse region is
  turned on and small kernels marked fusable will be fused.

* Add launch macros for fusing.

* Add fusing to a number of functions used by linear solvers.  Note that
  there are a lot more amrex functions need to be updated for fusing.

* Optimize reduction for bottom solve.

* Consolidate memcpy in communication functions.

* Option to use device memory in communication kernels for packing and
  unpacking buffers.  But it's currently turned off because the performance
  was not improved in testing.  In fact, it was worse than using pinned
  memory.  But this might change in the future.  So the option is kept.

## Checklist

The proposed changes:
- [ ] fix a bug or incorrect behavior in AMReX
- [x] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX users
- [ ] are described in the proposed changes to the AMReX documentation, if appropriate
dwillcox pushed a commit to dwillcox/amrex that referenced this pull request Oct 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants