Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding OpenACC statements to accelerate MYNN surface scheme performan…
…ce through GPU offloading Overview: --------- With very minimal changes to the original code of the scheme, the MYNN surface scheme has been enhanced with OpenACC statements which introduce the capability for offloading computational execution to OpenACC-supported accelerator devices (e.g. Nvidia GPUs). Since the scheme operates by looping multiple times over independent vertical columns, the overall computational strategy maps well to GPU hardware where multiple iterations of each loop can be run in parallel with SIMD methods. Data movement has been optimized to ensure data transfers from host memory to device memory are limited as data movement is a significant source of latency when performing offloading to accelerator devices. Performance increases on a GPU ranged from a 3.3x slowdown to a 41.9x speedup versus CPU execution (See the Performance section for more information). MYNN Scheme Code Changes: ------------------------- A few minor code changes were unavoidable due to certain limitations on what OpenACC is able to execute on the accelerator within kernel and parallel blocks. A complete list of these changes is below: 1. Adding preprocessor directives to disable multiple standard output statements, including those used for debug output. The challenges of these are different depending on the view from the host or accelerator. When run in parallel on the accelerator, these statements are not guaranteed to be presented to the user in-order. Also, in limited cases, these statements would have to output variables that were not transferred to the GPU because they were not necessary for computation, introducing additional transfer overhead to ensure they were present only for these output statements. Further, with hundreds of threads executing at once, the output could be quite large and unwieldy. That said, some of these statements could have been run on the host to alleviate the problems introduced by parallelization on the device. However, this would have necessitated device-to-host transfers of variables to ensure values being output were correct while introducing additional transfer overhead costs to performance. Disabling these for accelerators only seemed the best course of action. These are disabled based on the presence of the __OPENACC compile time variable to ensure these are only disabled when the code is compiled for accelerator usage and does not affect CPU execution. 2. Changing the CCPP errmsg variable declaration on line 349 of module_sf_mynn.F90 to be a fixed 200 length character array. Since this variable is set at times in the middle of accelerator kernel regions, it must be present on the accelerator. However, when defined with "len=*", it is an assumed-size array, which OpenACC does not support on the accelerator. Rather than disable this variable completely, changing it to a fixed length allows it to be transferred to/from the accelerator and used. This change is enforced by preprocessor directives based on the presence of the __OPENACC compile time variable and ensures this change only occurs when the code is compiled for accelerator usage, therefore it does not affect CPU execution. 3. Adding preprocessor directives to "move" return statement on line 1399 of module_sf_mynn.F90 out of the main i-loop and instead execute it at line 2006 if errflg is set to 1. This change is necessary as OpenACC accelerator code cannot execute branching such as this, so this conditional return statement can only be executed by the host. This change is enforced by preprocessor directives based on the presence of the __OPENACC compile time variable and ensures this change only occurs when the code is compiled for accelerator usage, therefore it does not affect CPU execution. 4. Commenting out the zLhux local variable in the zolri function over lines 3671 to 3724. The zLhux variable appears to have been used only to capture values of zolri over multiple iterations, but is never used or passed along after this collection is completed. Since this array would be an assumed-size array based on the value of nmax at runtime, it would have been unsupported by OpenACC. But, since it is unused, the choice was made to simply comment out the variable and all lines related to it, allow the remaining code of the function to executed on the accelerator. Performance: ------------ Performance testing was performed on a single Nvidia P100 GPU versus a single 10-core Haswell CPU on Hera. Since the MYNN Surface scheme is a serial code, parallelization on the 10-core Haswell was performed using simple data partitioning across the 10 cores using OpenMP threads such that each thread received a near equal amount of data. When data movement was fully optimized for the accelerator -- meaning all CCPP Physics input variables were pre-loaded on the GPU as they would be when the CCPP infrastructure fully supports accelerator offloading -- GPU performance speedups range between 11.8X and 41.8X over the 10-core Haswell when the number of vertical columns (i) was varied between 150k and 750k, respectively. Performance Timings (optimized data movement) Columns (i) \ Compute | CPU | GPU | GPU Speedup | --------------------------------------------------------------------------- 150,000 | 263 ms | 22 ms | 11.9x | --------------------------------------------------------------------------- 450,000 | 766 ms | 28 ms | 27.0x | --------------------------------------------------------------------------- 750,000 | 1314 ms | 31 ms | 41.9x | --------------------------------------------------------------------------- However, standalone performance -- meaning all CCPP Physics input variables were initially loaded onto the GPU only after being declared in the MYNN subroutine calls -- was slightly less performant than the 10-core Haswell due to the additional overhead incurred by the data transfers. In this case, the decreasing performance lag for the GPU behind the CPU as the number of columns increases is due to the GPU performing better with more data (i.e. higher computational throughput) than the CPU despite more data needing to be transferred to the device. Performance Timings (standalone) Columns (i) \ Compute | CPU | GPU | GPU Speedup | ------------------------------------------------------------------------------- 150,000 | 263 ms | 862 ms | -3.3x | ------------------------------------------------------------------------------- 450,000 | 766 ms | 1767 ms | -2.3x | ------------------------------------------------------------------------------- 750,000 | 1314 ms | 2776 ms | -2.1x | ------------------------------------------------------------------------------- With these results, it is clear that this scheme will perform at its best on accelerators once the CCPP infrastructure also supports OpenACC. Contact Information: -------------------- This enhancement was performed by Timothy Sliwinski at NOAA GSL. Questions regarding these changes should be directed to timothy.s.sliwinski@noaa.gov
- Loading branch information