control_ca test is not reproducible with 2-threads #743

DeniseWorthen · 2021-08-10T17:27:38Z

Description

Changing to 2-threads in the control_ca test and running against the current baseline fails.

This was found while developing the the update to the low resolution coupled tests (where CA is enabled) for the cpld_2threads test. The test version of cpld_2threads failed.

To determine the cause, I added same thread change in the control_2threads to control_ca in the develop branch. The control_ca test with 2 threads failed.

baseline dir = /lustre/f2/pdata/ncep_shared/emc.nemspara/RT/NEMSfv3gfs/develop-20210806/INTEL/control_ca
working dir  = /lustre/f2/scratch/Denise.Worthen/FV3_RT/rt_536/control_ca
Checking test 001 control_ca results ....
 Comparing sfcf000.nc .........OK
 Comparing sfcf012.nc ............ALT CHECK......NOT OK
 Comparing atmf000.nc .........OK
 Comparing atmf012.nc ............ALT CHECK......NOT OK
 Comparing GFSFLX.GrbF00 .........OK
 Comparing GFSFLX.GrbF12 .........NOT OK
 Comparing GFSPRS.GrbF00 .........OK
 Comparing GFSPRS.GrbF12 .........NOT OK

 0: The total amount of wall time                        = 145.112470

To Reproduce:

Check out the current develop branch. Add the following change to the control_ca test and run the test against the current baseline.

diff --git a/tests/tests/control_ca b/tests/tests/control_ca
index f0f3b1c5..c657a15f 100644
--- a/tests/tests/control_ca
+++ b/tests/tests/control_ca
@@ -36,6 +36,13 @@ export FV3_RUN=control_run.IN
 export CCPP_SUITE=FV3_GFS_v16
 export INPUT_NML=control_ca.nml.IN

+export THRD=2
+export TASKS=$TASKS_thrd
+export TPN=$TPN_thrd
+export INPES=$INPES_thrd
+export JNPES=$JNPES_thrd
+export WRTTASK_PER_GROUP=6
+
 export DO_CA=.T.
 export CA_SGS=.T.
 export CA_GLOBAL=.T.

The text was updated successfully, but these errors were encountered:

DeniseWorthen · 2021-08-10T17:50:08Z

@ligiabernardet would you be the person to look into this? If not, could you suggest someone? Thanks.

pjpegion · 2021-08-10T17:52:14Z

@lisa-bengtsson

DeniseWorthen · 2021-08-10T17:53:33Z

Thanks @pjpegion. I think I had the wrong name associated w/ CA.

lisa-bengtsson · 2021-08-10T17:55:25Z

@DeniseWorthen is this on all machines?

DeniseWorthen · 2021-08-10T18:01:05Z

The particular test I ran was on Gaea. Since we run control_ca on all machines and control_2threads on all machines except wcoss-cray, I would expect the same behavior.

lisa-bengtsson · 2021-08-10T18:22:31Z

I'm looking into it now

lisa-bengtsson · 2021-08-10T19:07:40Z

@DeniseWorthen does it work if CA_GLOBAL is set to False?

lisa-bengtsson · 2021-08-10T19:09:13Z

Also, what is iseed_ca in your test?

DeniseWorthen · 2021-08-10T19:16:56Z

I didn't make any changes to the control_ca test other than adding the threading. The default test uses ISEED_CA=12345.

I can try to run w/ CA_GLOBAL=false.

lisa-bengtsson · 2021-08-10T19:17:57Z

Thank you, that is very helpful.

lisa-bengtsson · 2021-08-10T19:35:03Z

Denise, are you comparing your experiment with 2 threads with a baseline generated with 1 thread?

DeniseWorthen · 2021-08-10T19:37:21Z

Yes, that is how the 2threads tests work. We compare the threaded run against the control.

DeniseWorthen · 2021-08-10T19:41:56Z

My test w/ CA_GLOBAL=F also failed against the control_ca.

lisa-bengtsson · 2021-08-10T19:50:56Z

Ok, thanks for checking. I'm discussing with @pjpegion now. There is no code in the CA using OMP threads, and on hera intel it compiles and runs in debug mode. I recall the message from @DomHeinzeller regarding the code failing on Cheyenne GNU compiler due to a "Floating-point exception" error. Maybe it is related... looking into this now.

pjpegion · 2021-08-10T20:43:05Z

@DeniseWorthen I was looking at the regression tests, and it seems that the coupled 2-threads also changes the processor layout for the atmosphere. so this test is also checking decomposition. I confirmed that the CA code works with threads in atmosphere only with the same processor layout.

I suspect the different answer is due to the way the random seed is defined on each task, and is not a threading issue.

DeniseWorthen · 2021-08-10T20:52:10Z

Thanks @pjpegion. So how will the coupled model running with CA as the default pass a 2threads test?

pjpegion · 2021-08-10T20:53:58Z

Use the same processor layout for the 1 and 2 threaded test.

lisa-bengtsson · 2021-08-10T20:54:42Z

@DeniseWorthen the random seed is essentially defined as seed = (iseed_ca + timestep + mype), if I don't use the "mype" dependency each processor gets the same random pattern. I would like to keep it this way if possible. As Phil suggests, can we use the same processor layout for the 1 and 2 threaded tests?

DeniseWorthen · 2021-08-10T20:54:43Z

I will need to check w/ Jun when she gets back as to why that is not how the 2threads tests are setup.

pjpegion · 2021-08-10T20:56:18Z

please do since I wasted a bunch of time today thinking that was an openmp issue, when it is a mpi decomposition issue

DeniseWorthen · 2021-08-10T20:56:22Z

So thanks both of you for your quick effort and explanation.

pjpegion · 2021-08-10T20:57:55Z

@lisa-bengtsson and I are also trying to figure out a solution to deal with the mpi decomposition issue. But it won't be quick.

SMoorthi-emc · 2021-08-10T21:03:46Z

FYI, I use reproducible random numbers for ras independent of number of threads.

…

Sent from my iPhone

On Aug 10, 2021, at 4:58 PM, Phil Pegion ***@***.***> wrote: @lisa-bengtsson and I are also trying to figure out a solution to deal with the mpi decomposition issue. But it won't be quick. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

DeniseWorthen · 2021-08-10T21:04:06Z

The goal has been for the updated RTs for the coupled model to track the Prototype configurations. But I can add the new cpld_2threads test with CA turned off if necessary. I would need to have a control case w/o CA also though which is less than ideal.

lisa-bengtsson · 2021-08-10T21:08:11Z

@SMoorthi-emc yes, the random numbers in the code are independent of threads. But I do have a dependency on "mype", otherwise each processor will have the same pattern. How is it done in RAS?

SMoorthi-emc · 2021-08-10T21:11:17Z

It is independent of number of mpi tasks. It only depends on global location.

…

Sent from my iPhone

On Aug 10, 2021, at 5:08 PM, lisa-bengtsson ***@***.***> wrote: @SMoorthi-emc yes, the random numbers in the code are independent of threads. But I do have a dependency on "mype", otherwise each processor will have the same pattern. How is it done in RAS? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

lisa-bengtsson · 2021-08-10T21:12:19Z

@DeniseWorthen I understand, is it possible to do so as a temporary solution, we are thinking of ways to improve this. Perhaps using lat/lon as Moorthi has done could be a solution. May need a couple of months for testing.

lisa-bengtsson · 2021-08-10T21:14:24Z

@SMoorthi-emc thanks, we will take a look at your code.

DeniseWorthen · 2021-08-10T21:18:21Z

@lisa-bengtsson I'll need to talk to Jun about what priority order we want: low number of tests vs. consistency w/ the prototypes vs. testing threading.

junwang-noaa · 2021-08-19T20:45:53Z

@lisa-bengtsson I'd like to follow up with this issue. Will Moorthi's method of using the global location rather than the task number give you different pattern on each task? Currently all the upcoming PRs that change results will be on hold until the issues with P7c are resolved.

pjpegion · 2021-08-19T20:57:01Z

@junwang-noaa Either Moorthi's method or the method used by SPPT will create random patterns that are not dependent on processor layout, but either will change the results compared to the current CA implementation.

This feature should not prevent P7c testing since the current results are valid for any processor layout, just not bitwise reproducible.

lisa-bengtsson · 2021-08-19T21:20:29Z

@junwang-noaa Phil is planning a future PR that will change the control_ca baseline including:
-unit testing/stand alone CA
-new CA seed generation to ensure reproducibility on processors
-inclusion of control_ca_restart test

But like he mentions above, the scientific results generated by P7c without this PR will still be valid. We are replacing one random number with another random number (that is why new baselines will be needed).

junwang-noaa · 2021-08-19T22:32:01Z

Phil/Lisa, the issue here is that we can not maintain a working decomposition test using different number of MPI tasks for all the future commits coming to P7 test unless we turn off CA. As Denise mentioned before, we can turn off CA in the P7 regression test, which will diverge from the real P7, and we can then test the decomposition reproducibility with future commits and features added to P7.

…

On Thu, Aug 19, 2021 at 5:20 PM lisa-bengtsson ***@***.***> wrote: @junwang-noaa <https://github.com/junwang-noaa> Phil is planning a future PR that will change the control_ca baseline including: -unit testing/stand alone CA -new CA seed generation to ensure reproducibility on processors -inclusion of control_ca_restart test But like he mentions above, the scientific results generated by P7c without this PR will still be valid. We are replacing one random number with another random number (that is why new baselines will be needed). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#743 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI7D6TMEUFBUZB67J6OZ7ATT5VYSRANCNFSM5B4RO6KQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

lisa-bengtsson · 2021-08-20T18:48:27Z

Hi Jun, regarding this MPI reproducibility issue, I believe we need a bit more time to find the best solution to the new seed generation, so if it is possible to do as you suggest above, I would greatly appreciate it. I think we should be able to commit a PR that ensures different MPI decomposition reproducibility using the CA in a couple weeks.

junwang-noaa · 2021-08-20T19:23:28Z

Lisa, thanks for the information. So we will turn off the CA in the P7 for the decomposition (control) regression test.

lisa-bengtsson · 2021-08-20T19:25:04Z

confirmed

* update vertical structure of NCO mode * update sample script for nco * Fix typo on write component of new RRFS CONUS

DeniseWorthen added the bug Something isn't working label Aug 10, 2021

DeniseWorthen changed the title ~~contol_ca test is not reproducible with 2-threads~~ control_ca test is not reproducible with 2-threads Aug 10, 2021

DeniseWorthen mentioned this issue Aug 29, 2021

Update coupled tests to use P7 configuration, add standalone P7 test suite #765

Merged

27 tasks

lisa-bengtsson mentioned this issue Sep 23, 2021

Stochastic physics cleanup NOAA-PSL/stochastic_physics#47

Merged

pjpegion mentioned this issue Sep 24, 2021

Stochastic physics clean and updates to CA #832

Merged

14 tasks

DeniseWorthen linked a pull request Sep 30, 2021 that will close this issue

Stochastic physics clean and updates to CA #832

Merged

14 tasks

DeniseWorthen closed this as completed in #832 Sep 30, 2021

epic-cicd-jenkins pushed a commit that referenced this issue Apr 17, 2023

Update directory structure of NCO mode (#743)

1644b11

* update vertical structure of NCO mode * update sample script for nco * Fix typo on write component of new RRFS CONUS

control_ca test is not reproducible with 2-threads #743

control_ca test is not reproducible with 2-threads #743

Comments

DeniseWorthen commented Aug 10, 2021

Description

To Reproduce:

DeniseWorthen commented Aug 10, 2021

pjpegion commented Aug 10, 2021

DeniseWorthen commented Aug 10, 2021

lisa-bengtsson commented Aug 10, 2021

DeniseWorthen commented Aug 10, 2021 • edited Loading

lisa-bengtsson commented Aug 10, 2021

lisa-bengtsson commented Aug 10, 2021

lisa-bengtsson commented Aug 10, 2021

DeniseWorthen commented Aug 10, 2021

lisa-bengtsson commented Aug 10, 2021

lisa-bengtsson commented Aug 10, 2021

DeniseWorthen commented Aug 10, 2021

DeniseWorthen commented Aug 10, 2021

lisa-bengtsson commented Aug 10, 2021

pjpegion commented Aug 10, 2021

DeniseWorthen commented Aug 10, 2021

pjpegion commented Aug 10, 2021

lisa-bengtsson commented Aug 10, 2021

DeniseWorthen commented Aug 10, 2021

pjpegion commented Aug 10, 2021

DeniseWorthen commented Aug 10, 2021

pjpegion commented Aug 10, 2021

SMoorthi-emc commented Aug 10, 2021 via email

DeniseWorthen commented Aug 10, 2021 • edited Loading

lisa-bengtsson commented Aug 10, 2021

SMoorthi-emc commented Aug 10, 2021 via email

lisa-bengtsson commented Aug 10, 2021

lisa-bengtsson commented Aug 10, 2021

DeniseWorthen commented Aug 10, 2021

junwang-noaa commented Aug 19, 2021

pjpegion commented Aug 19, 2021

lisa-bengtsson commented Aug 19, 2021

junwang-noaa commented Aug 19, 2021 via email

lisa-bengtsson commented Aug 20, 2021

junwang-noaa commented Aug 20, 2021

lisa-bengtsson commented Aug 20, 2021

DeniseWorthen commented Aug 10, 2021 •

edited

Loading

DeniseWorthen commented Aug 10, 2021 •

edited

Loading