Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control_ca test is not reproducible with 2-threads #743

Closed
DeniseWorthen opened this issue Aug 10, 2021 · 36 comments · Fixed by #832
Closed

control_ca test is not reproducible with 2-threads #743

DeniseWorthen opened this issue Aug 10, 2021 · 36 comments · Fixed by #832
Labels
bug Something isn't working

Comments

@DeniseWorthen
Copy link
Collaborator

Description

Changing to 2-threads in the control_ca test and running against the current baseline fails.

This was found while developing the the update to the low resolution coupled tests (where CA is enabled) for the cpld_2threads test. The test version of cpld_2threads failed.

To determine the cause, I added same thread change in the control_2threads to control_ca in the develop branch. The control_ca test with 2 threads failed.

baseline dir = /lustre/f2/pdata/ncep_shared/emc.nemspara/RT/NEMSfv3gfs/develop-20210806/INTEL/control_ca
working dir  = /lustre/f2/scratch/Denise.Worthen/FV3_RT/rt_536/control_ca
Checking test 001 control_ca results ....
 Comparing sfcf000.nc .........OK
 Comparing sfcf012.nc ............ALT CHECK......NOT OK
 Comparing atmf000.nc .........OK
 Comparing atmf012.nc ............ALT CHECK......NOT OK
 Comparing GFSFLX.GrbF00 .........OK
 Comparing GFSFLX.GrbF12 .........NOT OK
 Comparing GFSPRS.GrbF00 .........OK
 Comparing GFSPRS.GrbF12 .........NOT OK

 0: The total amount of wall time                        = 145.112470

To Reproduce:

Check out the current develop branch. Add the following change to the control_ca test and run the test against the current baseline.

diff --git a/tests/tests/control_ca b/tests/tests/control_ca
index f0f3b1c5..c657a15f 100644
--- a/tests/tests/control_ca
+++ b/tests/tests/control_ca
@@ -36,6 +36,13 @@ export FV3_RUN=control_run.IN
 export CCPP_SUITE=FV3_GFS_v16
 export INPUT_NML=control_ca.nml.IN

+export THRD=2
+export TASKS=$TASKS_thrd
+export TPN=$TPN_thrd
+export INPES=$INPES_thrd
+export JNPES=$JNPES_thrd
+export WRTTASK_PER_GROUP=6
+
 export DO_CA=.T.
 export CA_SGS=.T.
 export CA_GLOBAL=.T.
@DeniseWorthen DeniseWorthen added the bug Something isn't working label Aug 10, 2021
@DeniseWorthen DeniseWorthen changed the title contol_ca test is not reproducible with 2-threads control_ca test is not reproducible with 2-threads Aug 10, 2021
@DeniseWorthen
Copy link
Collaborator Author

@ligiabernardet would you be the person to look into this? If not, could you suggest someone? Thanks.

@pjpegion
Copy link
Collaborator

@lisa-bengtsson

@DeniseWorthen
Copy link
Collaborator Author

Thanks @pjpegion. I think I had the wrong name associated w/ CA.

@lisa-bengtsson
Copy link
Contributor

@DeniseWorthen is this on all machines?

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Aug 10, 2021

The particular test I ran was on Gaea. Since we run control_ca on all machines and control_2threads on all machines except wcoss-cray, I would expect the same behavior.

@lisa-bengtsson
Copy link
Contributor

I'm looking into it now

@lisa-bengtsson
Copy link
Contributor

@DeniseWorthen does it work if CA_GLOBAL is set to False?

@lisa-bengtsson
Copy link
Contributor

Also, what is iseed_ca in your test?

@DeniseWorthen
Copy link
Collaborator Author

I didn't make any changes to the control_ca test other than adding the threading. The default test uses ISEED_CA=12345.

I can try to run w/ CA_GLOBAL=false.

@lisa-bengtsson
Copy link
Contributor

Thank you, that is very helpful.

@lisa-bengtsson
Copy link
Contributor

Denise, are you comparing your experiment with 2 threads with a baseline generated with 1 thread?

@DeniseWorthen
Copy link
Collaborator Author

Yes, that is how the 2threads tests work. We compare the threaded run against the control.

@DeniseWorthen
Copy link
Collaborator Author

My test w/ CA_GLOBAL=F also failed against the control_ca.

@lisa-bengtsson
Copy link
Contributor

Ok, thanks for checking. I'm discussing with @pjpegion now. There is no code in the CA using OMP threads, and on hera intel it compiles and runs in debug mode. I recall the message from @DomHeinzeller regarding the code failing on Cheyenne GNU compiler due to a "Floating-point exception" error. Maybe it is related... looking into this now.

@pjpegion
Copy link
Collaborator

@DeniseWorthen I was looking at the regression tests, and it seems that the coupled 2-threads also changes the processor layout for the atmosphere. so this test is also checking decomposition. I confirmed that the CA code works with threads in atmosphere only with the same processor layout.

I suspect the different answer is due to the way the random seed is defined on each task, and is not a threading issue.

@DeniseWorthen
Copy link
Collaborator Author

Thanks @pjpegion. So how will the coupled model running with CA as the default pass a 2threads test?

@pjpegion
Copy link
Collaborator

Use the same processor layout for the 1 and 2 threaded test.

@lisa-bengtsson
Copy link
Contributor

@DeniseWorthen the random seed is essentially defined as seed = (iseed_ca + timestep + mype), if I don't use the "mype" dependency each processor gets the same random pattern. I would like to keep it this way if possible. As Phil suggests, can we use the same processor layout for the 1 and 2 threaded tests?

@DeniseWorthen
Copy link
Collaborator Author

I will need to check w/ Jun when she gets back as to why that is not how the 2threads tests are setup.

@pjpegion
Copy link
Collaborator

please do since I wasted a bunch of time today thinking that was an openmp issue, when it is a mpi decomposition issue

@DeniseWorthen
Copy link
Collaborator Author

So thanks both of you for your quick effort and explanation.

@pjpegion
Copy link
Collaborator

@lisa-bengtsson and I are also trying to figure out a solution to deal with the mpi decomposition issue. But it won't be quick.

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Aug 10, 2021 via email

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Aug 10, 2021

The goal has been for the updated RTs for the coupled model to track the Prototype configurations. But I can add the new cpld_2threads test with CA turned off if necessary. I would need to have a control case w/o CA also though which is less than ideal.

@lisa-bengtsson
Copy link
Contributor

@SMoorthi-emc yes, the random numbers in the code are independent of threads. But I do have a dependency on "mype", otherwise each processor will have the same pattern. How is it done in RAS?

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Aug 10, 2021 via email

@lisa-bengtsson
Copy link
Contributor

@DeniseWorthen I understand, is it possible to do so as a temporary solution, we are thinking of ways to improve this. Perhaps using lat/lon as Moorthi has done could be a solution. May need a couple of months for testing.

@lisa-bengtsson
Copy link
Contributor

@SMoorthi-emc thanks, we will take a look at your code.

@DeniseWorthen
Copy link
Collaborator Author

@lisa-bengtsson I'll need to talk to Jun about what priority order we want: low number of tests vs. consistency w/ the prototypes vs. testing threading.

@junwang-noaa
Copy link
Collaborator

@lisa-bengtsson I'd like to follow up with this issue. Will Moorthi's method of using the global location rather than the task number give you different pattern on each task? Currently all the upcoming PRs that change results will be on hold until the issues with P7c are resolved.

@pjpegion
Copy link
Collaborator

@junwang-noaa Either Moorthi's method or the method used by SPPT will create random patterns that are not dependent on processor layout, but either will change the results compared to the current CA implementation.

This feature should not prevent P7c testing since the current results are valid for any processor layout, just not bitwise reproducible.

@lisa-bengtsson
Copy link
Contributor

@junwang-noaa Phil is planning a future PR that will change the control_ca baseline including:
-unit testing/stand alone CA
-new CA seed generation to ensure reproducibility on processors
-inclusion of control_ca_restart test

But like he mentions above, the scientific results generated by P7c without this PR will still be valid. We are replacing one random number with another random number (that is why new baselines will be needed).

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Aug 19, 2021 via email

@lisa-bengtsson
Copy link
Contributor

Hi Jun, regarding this MPI reproducibility issue, I believe we need a bit more time to find the best solution to the new seed generation, so if it is possible to do as you suggest above, I would greatly appreciate it. I think we should be able to commit a PR that ensures different MPI decomposition reproducibility using the CA in a couple weeks.

@junwang-noaa
Copy link
Collaborator

Lisa, thanks for the information. So we will turn off the CA in the P7 for the decomposition (control) regression test.

@lisa-bengtsson
Copy link
Contributor

confirmed

@DeniseWorthen DeniseWorthen linked a pull request Sep 30, 2021 that will close this issue
14 tasks
epic-cicd-jenkins pushed a commit that referenced this issue Apr 17, 2023
* update vertical structure of NCO mode

* update sample script for nco

* Fix typo on write component of new RRFS CONUS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants