Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Removing preprocessor directives to re-enable print statements on GPU…
… for debug and other conditions. Original problem: ----------------- Following feedback that debug information was still desirable for OpenACC device- executed code where possible, this change removes all preprocessor directives which were guarding against the compilation of statements which wrote to standard output. These directives were originally used because debug statements and other standard output had the potential to greatly reduce performance because of the need to copy over certain variables from the host to the device just for debug output purposes. Additionally, when statements were located within parallel-execution regions, the output was not guaranteed to be presented in any specific order and the additional IF-branches in the code also would have reduced performance as branching is not efficient when on SIMD architectures. Resolutions: ------------ However, with a bit of extra work, a few of these issues are alleviated to allow output to work again as requested. First, on the data optimization side of the problem, the impact of pulling in variables just for debugging was minimized by ensuring the data was pulled in and resident on the GPU for the entire subroutine execution. While this increases the memory footprint on the device which may have very limited memory, it reduces the data transfer related performance hit. Next, in the cases where debug output was not within parallel regions but still needing to be executed on the GPU to show the proper values at that state of the overall program execution, OpenACC serial regions were used. These allow the data to not have to be transferred off the GPU mid-execution of the program just to be shown as debug output and also partially solve the problem of out-of-order output. Since debug regions are guarded by IF blocks, these serial regions do not significantly impact performance when debug output is turned off (debug_code=0). However, slowdown is significant for any other debug-levels which should be acceptable for debugging situations. Performance Changes: -------------------- Overall, these changes accomplish the goal of re-enabling debugging output, but not completely without a cost. Overall runtime was slightly impacted on the GPU when tested with 150k and 750k vertical columns (the value of ite used in the i-loops) and debugging turned off (debug_code=0). For 150k columns, the GPU decreased in speed from the original baseline of 22ms to 30ms. For 750k columns, the GPU decreased in speed from the original baseline of 31ms to 70ms. The impact is greater for the larger number of columns due to the impact of the number of times the mid-loop IF branches are evaluated on the GPU. While these are slight declines in performance, these are still significant speedups over the CPU-only tests (8.7x and 18.7x speedups for 150k and 750k, respectively). Compilation Time Changes: ------------------------- One additional noted observation regarding performance is compilation time. When all debug output is disabled (debug_code=0), compilation time is approximately 90 seconds with the additional serial blocks, IF-branches, and so forth as each of these require more work from the OpenACC compiler to generate code for the GPU. This problem is compounded when the debug_code option is increase to either 1 (some debug output) or 2 (full debug output). At a value of 1, compilation time jumps up to approximately 12.5 minutes on the Hera GPU nodes. At a value of 2, compilation time increases further to approximately 18.5 minutes on the same GPU nodes. The explanation for this is the need for the OpenACC compiler to enable greater amounts of serial and branching code that (again) are less optimal on the GPU and so the compiler must do more work to try to optimize them as best it can.
- Loading branch information