-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
control_ca test is not reproducible with 2-threads #743
Comments
@ligiabernardet would you be the person to look into this? If not, could you suggest someone? Thanks. |
Thanks @pjpegion. I think I had the wrong name associated w/ CA. |
@DeniseWorthen is this on all machines? |
The particular test I ran was on Gaea. Since we run control_ca on all machines and control_2threads on all machines except wcoss-cray, I would expect the same behavior. |
I'm looking into it now |
@DeniseWorthen does it work if CA_GLOBAL is set to False? |
Also, what is iseed_ca in your test? |
I didn't make any changes to the control_ca test other than adding the threading. The default test uses ISEED_CA=12345. I can try to run w/ CA_GLOBAL=false. |
Thank you, that is very helpful. |
Denise, are you comparing your experiment with 2 threads with a baseline generated with 1 thread? |
Yes, that is how the 2threads tests work. We compare the threaded run against the control. |
My test w/ CA_GLOBAL=F also failed against the control_ca. |
Ok, thanks for checking. I'm discussing with @pjpegion now. There is no code in the CA using OMP threads, and on hera intel it compiles and runs in debug mode. I recall the message from @DomHeinzeller regarding the code failing on Cheyenne GNU compiler due to a "Floating-point exception" error. Maybe it is related... looking into this now. |
@DeniseWorthen I was looking at the regression tests, and it seems that the coupled 2-threads also changes the processor layout for the atmosphere. so this test is also checking decomposition. I confirmed that the CA code works with threads in atmosphere only with the same processor layout. I suspect the different answer is due to the way the random seed is defined on each task, and is not a threading issue. |
Thanks @pjpegion. So how will the coupled model running with CA as the default pass a 2threads test? |
Use the same processor layout for the 1 and 2 threaded test. |
@DeniseWorthen the random seed is essentially defined as seed = (iseed_ca + timestep + mype), if I don't use the "mype" dependency each processor gets the same random pattern. I would like to keep it this way if possible. As Phil suggests, can we use the same processor layout for the 1 and 2 threaded tests? |
I will need to check w/ Jun when she gets back as to why that is not how the 2threads tests are setup. |
please do since I wasted a bunch of time today thinking that was an openmp issue, when it is a mpi decomposition issue |
So thanks both of you for your quick effort and explanation. |
@lisa-bengtsson and I are also trying to figure out a solution to deal with the mpi decomposition issue. But it won't be quick. |
FYI, I use reproducible random numbers for ras independent of number of threads.
…Sent from my iPhone
On Aug 10, 2021, at 4:58 PM, Phil Pegion ***@***.***> wrote:
@lisa-bengtsson and I are also trying to figure out a solution to deal with the mpi decomposition issue. But it won't be quick.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
The goal has been for the updated RTs for the coupled model to track the Prototype configurations. But I can add the new cpld_2threads test with CA turned off if necessary. I would need to have a control case w/o CA also though which is less than ideal. |
@SMoorthi-emc yes, the random numbers in the code are independent of threads. But I do have a dependency on "mype", otherwise each processor will have the same pattern. How is it done in RAS? |
It is independent of number of mpi tasks.
It only depends on global location.
…Sent from my iPhone
On Aug 10, 2021, at 5:08 PM, lisa-bengtsson ***@***.***> wrote:
@SMoorthi-emc yes, the random numbers in the code are independent of threads. But I do have a dependency on "mype", otherwise each processor will have the same pattern. How is it done in RAS?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
@DeniseWorthen I understand, is it possible to do so as a temporary solution, we are thinking of ways to improve this. Perhaps using lat/lon as Moorthi has done could be a solution. May need a couple of months for testing. |
@SMoorthi-emc thanks, we will take a look at your code. |
@lisa-bengtsson I'll need to talk to Jun about what priority order we want: low number of tests vs. consistency w/ the prototypes vs. testing threading. |
@lisa-bengtsson I'd like to follow up with this issue. Will Moorthi's method of using the global location rather than the task number give you different pattern on each task? Currently all the upcoming PRs that change results will be on hold until the issues with P7c are resolved. |
@junwang-noaa Either Moorthi's method or the method used by SPPT will create random patterns that are not dependent on processor layout, but either will change the results compared to the current CA implementation. This feature should not prevent P7c testing since the current results are valid for any processor layout, just not bitwise reproducible. |
@junwang-noaa Phil is planning a future PR that will change the control_ca baseline including: But like he mentions above, the scientific results generated by P7c without this PR will still be valid. We are replacing one random number with another random number (that is why new baselines will be needed). |
Phil/Lisa, the issue here is that we can not maintain a working
decomposition test using different number of MPI tasks for all the future
commits coming to P7 test unless we turn off CA. As Denise mentioned
before, we can turn off CA in the P7 regression test, which will diverge
from the real P7, and we can then test the decomposition reproducibility
with future commits and features added to P7.
…On Thu, Aug 19, 2021 at 5:20 PM lisa-bengtsson ***@***.***> wrote:
@junwang-noaa <https://github.com/junwang-noaa> Phil is planning a future
PR that will change the control_ca baseline including:
-unit testing/stand alone CA
-new CA seed generation to ensure reproducibility on processors
-inclusion of control_ca_restart test
But like he mentions above, the scientific results generated by P7c
without this PR will still be valid. We are replacing one random number
with another random number (that is why new baselines will be needed).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#743 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TMEUFBUZB67J6OZ7ATT5VYSRANCNFSM5B4RO6KQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Hi Jun, regarding this MPI reproducibility issue, I believe we need a bit more time to find the best solution to the new seed generation, so if it is possible to do as you suggest above, I would greatly appreciate it. I think we should be able to commit a PR that ensures different MPI decomposition reproducibility using the CA in a couple weeks. |
Lisa, thanks for the information. So we will turn off the CA in the P7 for the decomposition (control) regression test. |
confirmed |
* update vertical structure of NCO mode * update sample script for nco * Fix typo on write component of new RRFS CONUS
Description
Changing to 2-threads in the control_ca test and running against the current baseline fails.
This was found while developing the the update to the low resolution coupled tests (where CA is enabled) for the cpld_2threads test. The test version of cpld_2threads failed.
To determine the cause, I added same thread change in the
control_2threads
tocontrol_ca
in the develop branch. Thecontrol_ca
test with 2 threads failed.To Reproduce:
Check out the current develop branch. Add the following change to the
control_ca
test and run the test against the current baseline.The text was updated successfully, but these errors were encountered: