-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FC5AV1C-04P2 is non-BFB with threading turned on #1203
Comments
I ran few tests and identified the following line in cam/src/physics/cam/physpkg.F90 which might be causing this (line 997): |
@amametjanov ran some tests on Mira and found the results to be BFB. This may imply that the non-BFB behavior is only associated with the Intel compiler. I am not sure if @wlin7 has tested it with any other compiler. I have only run tests with the Intel compiler. Following is the email I got from @amametjanov :
|
@singhbalwinder , @amametjanov , for record, my non-BFB runs were also using intel compiler. |
FYI - I'm comparing pgi and intel on Titan at the moment. Two runs with the Intel compiler were not BFB. PGI job will be submitted next. |
I'll add some results here from Intel (v15) compiler on the SNL institutional cluster skybridge. Note that @singhbalwinder isolated the problem to the physics parameterizations, so that would suggest this is not due to the initialization, and thus it should be detected by ERP tests (which compare a restart run to a full run). ERP_Ld3_P8x4.ne4_ne4.FC5AV1C (PASS) SMS_Ld3_P32x4.ne16_ne16.FC5AV1C (compare against baseline - PASS) |
@mt5555 , FYI - physics parameterizations are called during initialization. |
On Titan, PGI was B4B for two identical runs, while Intel was not., using the 675x4 for ATM and 676x4 everything else for
|
Note that setting BFBFLAG to TRUE did not fix the problem. |
Thanks all for running these tests. So far it seems like the issue only exists when running with the Intel compiler on all machines but Skybridge. I am using intel/16.0.1.150, so it might just be some specific versions of the compiler. In my tests, the problem appears to be somewhere in the radiation codes. If I comment out the physics_update call after radiation, I get BFB answers for my five time steps runs. I am currently looking into the radiation codes. |
Hi all, Per Shaocheng's request, I have created a confluence page to track this problem. The problem and the main findings up to now are copied there. Since not everybody can see the github notice, main findings/updates here will be summarized and posted on the confluence until the problem is resolved. Thanks. |
I can test on Cori with Intel 17. Is it just one of the tests? |
you should be able to run an ERP test which will compare a restart run with a full run. create a baseline: make a new run, and compare against the baseline: I believe @wlin7 isn't using the test system, but instead just making two runs, and than diffing the atm.log file. |
Thanks @wlin7 for setting up the confluence page. @ndkeen : To reproduce this bug, do the following: Run SMS.ne16_ne16.FC5AV1C-04P2 using a PE layout which has more than one thread. This run will generate atm.log.* file in the run directory. After this run finishes, go to the case directory of this test and issue
If you see differences in the numbers after |
And in my runs the difference first shows up on nstep 3, so you do not need a long run to see this. I've been using 1 day, though 5 step is probably enough. |
I ran this test on Edison, with acme master Jan 6: ERP_Ld3_P8x4.ne4_ne4.FC5AV1C FAIL So this confirms that the problem does show up in an ERP test and at very low resolution (ne4). This identical test will PASS on skybridge. These are both xeon systems, the main difference being intel compiler versions. |
I tried FC5AV1C-04P (not 04P2) and the differences showed up with nstep 1. I went back and looked again, and the same was true for FC5AV1C-04P2 (not nstep 3). Sorry for the misinformation. |
@mt5555 , the Titan results are from using intel/15.0.2.164 . |
Ack, I got:
This is with a master a few days ago, but still with cime5 |
Skybridge has Intel 15.0.1 and Mark reports it passes. Titan has 15.0.2 and Pat reports a fail. Someone should try Blues (15.0.0). |
I will try ERP_Ld3_P8x4.ne4_ne4.FC5AV1C on blues |
Noel, I saw this "array index out of bounds" during initialization on cori-knl when doing ne120 simulations, after successful runs using the same executable. Rerun was ok, so no further effort to dig into it. Don't know if it could be related to the current issue. You may also set "cosp_lite=.true." in user_nl_cam to reduce memory usage if the model is built with COSP. |
I ran ERP_Ld3_P8x4.ne4_ne4.FC5AV1C on blues with intel 15.0 (the default with --compiler=intel) and it crashed for me. Can someone else try it on blues? |
Note: There's a new test type in cime5 called "REP" which will simply do 2 identical runs and compare results. Try REP_Ld1_P8x4.ne4_ne4.FC5AV1C. |
@jayeshkrishna , for the blues. are you getting error message like HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:912): assert (!closed) failed I have been seeing such failure lately, a lot, with the model build previously working fine. |
The error message above (HYD_pmcd_... ) can occur whenever a process crashes and the process manager is unable to clean up resources (close sockets etc) . |
I re-ran on cori, and got the same problem (array index out of bounds). I too have seen this and I think sometimes it was related to using too much memory on the node. This case was asking for 64 MPI's on 1 node, so I just submitted asking for 64 MPI's across 4 nodes -- this also failed, different error, but not helpful. All I can tell is it happened before the previous one. I have run other tests similar to this on cori without a problem. I will try a debug run. OK, with DEBUG=TRUE run, the simulation is progressing. I have actually seen this behaviour before with a ne120 F compset on Cori -- where it runs DEBUG, but not optimized, however the error message is different. For the DEBUG run, it ran for 13 ATM steps. I submitted again and after 6 steps, all values are identical. At what point might it be not equal? Ah, I did not change env_mach_pes to use more than 1 thread. I am still trying to debug why it stops for me with 1 thread, but will also run with more. |
@worleyph : Are you using the same PE- layout for all your tests? |
@singhbalwinder , yes. See
Two threads should be enough, but @wlin7 started with 4 threads, and I have been sticking with that. |
Did a diff of the two directories. In cam/physics, the only differences are in
Looks like it is supposed to be BFB:
but perhaps something snuck in anyway? @singhbalwinder , which PR brought in these fixes? |
@singhbalwinder , I commented out the differences between PR #1128 and PR #1147 in three files:
and recovered BFB behavior. Looking at these changes, nothing looks wrong to me, so probably a compiler issue? The micro_mg code makes heavy use of pure and elemental functions and subroutines. I tried removing elemental from one routine in the above, and it made no difference. I tried removing more, and it would require removing LOTS more and changing many of the calls as well, so did not pursue this further. |
Backing out the changes to micro_mg_cam.F90 was sufficient (in this one test):
|
amazed you could track that down so quickly!
So this declaration:
logical :: cldfsnow_logic = .false.
makes me wonder if the Intel compiler makes the variable static and global
to the module? In which case each thread will not have its own copy.
You probably already have a workaround, but if not, it might be worth
trying:
logical :: cldfsnow_logic
cldfsnow_logic = .false.
(edit: I see above your original message already gave the workaround).
|
@mt5555 , Yes, this sort of variable initialization in a subroutine or function sets the SAVE attribute. This practice is not allowed in CAM (at least not on purpose, a few may have slipped through code review), could it be added to the ACME coding standards? |
I'll let someone else determine the fix. |
Actually, there is no reason to initialize this variable at all. It is set before it is used. I'll try this real quick. |
No - this was just restoring the original code, so not supporting the rrtmg_temp_fix option. Your fix was relevant, but initializing to .false. is not necessary. The use of the variable cldfsnow_logic is not really necessary either. The following should be fine as well, I think.
and this will all go away when the rrtmg_temp_fix is adopted, I assume. I'd vote for this latter solution, just to eliminate the cldfsnow_logic variable, but I don't really have a vote here. |
Both "fixes" worked (using my one test). |
@gold2718 : regarding the compiler forcing the 'SAVE' attribute: is that required by the standard or up to the compiler? Perhaps the PGI compiler keeping it a stack variable and adding initialization code so it works correctly? |
Thanks a lot @worleyph ! This is great! These were some temporary codes which are meant to be removed once we settle down on a configuration. I am familiar with this 'SAVE' attribute being assigned to all variable initialized at module level. I always assumed that it is in the standard rather than specific to a compiler. I am not sure why other compilers (PGI and XLF) are not complaining about it. Intel compiler on several machines like Eos, Skybridge etc. was also fine. On Eos, Intel compiler started showing non-BFB answers after CIME5 merge (not sure why). It would have been okay (although still a bad idea) if this logical were to initialize at the beginning and stayed the same for the entire simulation. This particular logical is based on I will follow @worleyph suggestion and fix the code. |
The Fortran standard specifies that all variables initialized in a routine (function or subroutine) inherit the SAVE attribute. Module-level variables are a bit more confusing since they were required to have an explicit SAVE attribute until the 2008 standard where the SAVE attribute became the default. I'm still using explicit SAVE statements at module level since I never know which standard a particular compiler is using. As for PGI and XLF, they have routinely ignored this sort of fine point in the standards. I still have #ifdef CPRPGI lines in the code to do things in a slower, broken way to work around places where PGI does not implement Fortran correctly. BTW, another reason to avoid SAVE variables in routines is that in many cases, it hurts performance. |
This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' into next This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219) This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219) This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
Fix bug in ERP namelist comparison. The following pattern was broken... % ./create_test -o -g ERP.f45_g37_rx1.A.melvin_gnu -b jimtest % ./create_test -c -n ERP.f45_g37_rx1.A.melvin_gnu -b jimtest In the first command, the cmpgen_namelists used to occur during the RUN phase and therefore occurs after the ERP build phase which impacts the contents of the namelist files. In the second command, the cmpgen_namelists occurs after the setup phase and therefore will not match the first. With this change, we have essentially re-introduced all the namelist creations that I tried to remove a few months ago in order to improve performance. The SETUP phase is back up to 15 seconds on melvin. Test suite: scripts_regression_tests Test baseline: Test namelist changes: Test status: bit for bit Fixes #1102 User interface changes?: Namelists in baselines are relative to post-SETUP phase Code review: @jedwards4b
This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219) This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219) This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
@wlin7 recently identified that the model is non-BFB when run with threading turned on. He has reproduced this problem on Eidson, Anvil and Cori with ne30 and ne120 grids. I (Balwinder) was also able to reproduce this on Eos with ne16 and ne30 grids. To reproduce this, I ran a smoke test on Eos:
SMS_Ln5.ne16_ne16.FC5AV1C-04P2
Resubmitting this test (./case.submit) produces non-BFB results when compared against the initial run.
The text was updated successfully, but these errors were encountered: