You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue was found as a result of Issue #974 and is related to UFS Issue 1556
In SR PDLIB_EXPLICIT_BLOCK, several do loops are nested inefficiently for fortran because they are not in column-major order. For example,
DO ITH = 1, NTH
ISP = ITH + (IK-1) * NTH
DO IP = 1, NP
VA(ISP,IP) = U(ITH,IP) * CGSIG(IP) / CLATS(IPLG(IP))
ENDDO
ENDDO
Since the IP loop is over the number of resident points on this decomposition element, the impact is greatest for relatively large node density.
To Reproduce
Steps to reproduce the behavior.
Expected behavior
The loops should be ordered so that the inner-most index varies the fastest.
do ip = 1,np
isp = 0
do ith = 1,nth
isp = ith + (ik-1)*nth
va(isp,ip) = u(ith,ip) * cgsig(ip) / clats(iplg(ip))
end do
end do
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
A test was carried out using the fully coupled test case for UFS. It used a 500K mesh on only 36 PEs, integrated for 24 hours with the existing code and one w/ modified loops. The answers were B4B. The wall clock times are as follows:
current code: The total amount of wall time = 2404.501605
reordered loops: The total amount of wall time = 1955.846262
The ESMF Profile gives the following values for the WAV RunPhase1 (min time, mean time, max time):
current code: 1746.3447 1056.5148 2326.2212
reordered loops: 1323.0430 640.5865 1878.7823
The wall clock time is reduced by ~20%, the mean WAV run phase time is reduced by ~39%.
I would consider this a high node density test case. Cases w/ lower node density would probably not show as great an impact.
The text was updated successfully, but these errors were encountered:
Describe the bug
This issue was found as a result of Issue #974 and is related to UFS Issue 1556
In SR
PDLIB_EXPLICIT_BLOCK
, several do loops are nested inefficiently for fortran because they are not in column-major order. For example,Since the
IP
loop is over the number of resident points on this decomposition element, the impact is greatest for relatively large node density.To Reproduce
Steps to reproduce the behavior.
Expected behavior
The loops should be ordered so that the inner-most index varies the fastest.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
A test was carried out using the fully coupled test case for UFS. It used a 500K mesh on only 36 PEs, integrated for 24 hours with the existing code and one w/ modified loops. The answers were B4B. The wall clock times are as follows:
current code:
The total amount of wall time = 2404.501605
reordered loops:
The total amount of wall time = 1955.846262
The ESMF Profile gives the following values for the WAV RunPhase1 (min time, mean time, max time):
current code: 1746.3447 1056.5148 2326.2212
reordered loops: 1323.0430 640.5865 1878.7823
The wall clock time is reduced by ~20%, the mean WAV run phase time is reduced by ~39%.
I would consider this a high node density test case. Cases w/ lower node density would probably not show as great an impact.
The text was updated successfully, but these errors were encountered: