-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More on Gpu kernel fusing #1332
Conversation
This is still a draft. I submitted this to run CI and I like the diff provided by github. I have done scaling test using
|
0a69f34
to
2175c0f
Compare
This PR is ready for review now. I have done regression tests on AMReX and various application codes on both GPU and CPU. It has passed all tests except for a few CPU tests using cell-centered linear solvers. There are roundoff errors in these tests due to a change in |
* Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate whether the kernel is an candidate for fusing. * For MFIter, if the local size is less or equal to 3, the fuse region is turned on and small kernels marked fusable will be fused. * Add launch macros for fusing. * Add fusing to a number of functions used by linear solvers. Note that there are a lot more amrex functions need to be updated for fusing. * Optimize reduction for bottom solve. * Consolidate memcpy in communication functions. * Option to use device memory in communication kernels for packing and unpacking buffers. But it's currently turned off because the performance was not improved in testing. In fact, it was worse than using pinned memory. But this might change in the future. So the option is kept.
auto d_tags = reinterpret_cast<TagType*>(d_buffer); | ||
auto d_nwarps = reinterpret_cast<int*>(d_buffer+offset_nwarps); | ||
|
||
constexpr int nthreads = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the increase in nthreads tested? Or just changing to be consistent with other launches?
Should this be based on MAX_NUM_THREADS or another more universal variable, so it can be easily adjusted for each architecture type without being forgotten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was tested. But I didn't really see much difference.
## Summary * Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate whether the kernel is an candidate for fusing. * For MFIter, if the local size is less or equal to 3, the fuse region is turned on and small kernels marked fusable will be fused. * Add launch macros for fusing. * Add fusing to a number of functions used by linear solvers. Note that there are a lot more amrex functions need to be updated for fusing. * Optimize reduction for bottom solve. * Consolidate memcpy in communication functions. * Option to use device memory in communication kernels for packing and unpacking buffers. But it's currently turned off because the performance was not improved in testing. In fact, it was worse than using pinned memory. But this might change in the future. So the option is kept. ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [x] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] are described in the proposed changes to the AMReX documentation, if appropriate
This reverts commit 4091007.
The most vexing parse in C++ strikes again. This closes AMReX-Codes#1422.
## Summary * Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate whether the kernel is an candidate for fusing. * For MFIter, if the local size is less or equal to 3, the fuse region is turned on and small kernels marked fusable will be fused. * Add launch macros for fusing. * Add fusing to a number of functions used by linear solvers. Note that there are a lot more amrex functions need to be updated for fusing. * Optimize reduction for bottom solve. * Consolidate memcpy in communication functions. * Option to use device memory in communication kernels for packing and unpacking buffers. But it's currently turned off because the performance was not improved in testing. In fact, it was worse than using pinned memory. But this might change in the future. So the option is kept. ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [x] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] are described in the proposed changes to the AMReX documentation, if appropriate
…1423) The most vexing parse in C++ strikes again. This closes AMReX-Codes#1422.
Summary
Add Gpu::KernelInfo argument to ParallelFor to allow the user to indicate
whether the kernel is an candidate for fusing.
For MFIter, if the local size is less or equal to 3, the fuse region is
turned on and small kernels marked fusable will be fused.
Add launch macros for fusing.
Add fusing to a number of functions used by linear solvers. Note that
there are a lot more amrex functions need to be updated for fusing.
Optimize reduction for bottom solve.
Consolidate memcpy in communication functions.
Option to use device memory in communication kernels for packing and
unpacking buffers. But it's currently turned off because the performance
was not improved in testing. In fact, it was worse than using pinned
memory. But this might change in the future. So the option is kept.
Checklist
The proposed changes: