Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_proc_bounds called from threaded region #1369

Open
amametjanov opened this issue Apr 5, 2017 · 14 comments
Open

get_proc_bounds called from threaded region #1369

amametjanov opened this issue Apr 5, 2017 · 14 comments
Assignees
Labels

Comments

@amametjanov
Copy link
Member

If there is an error in components/clm/src/biogeophys/BalanceCheckMod.F90, there is a call GetGlobalIndex which is expected to be called from non-threaded regions: GetGlobalIndex calls get_proc_bounds, which throws an abort.

The full stack trace is https://gist.github.com/amametjanov/0b67cf3f9ccec5b67b986ca1d2e43bb6 .

Possible fixes are to either remove the check in get_proc_bounds_new or surround the call sites with !$omp single ... !$omp end single.

@bishtgautam
Copy link
Contributor

@amametjanov: This is a good catch. I'm surprised it wasn't caught early. For the long term fix, I will have a discussion with NCAR CSEG. In the meantime, the easiest fix I can think of is to disable the call to GetGlobalIndex().

@bishtgautam
Copy link
Contributor

This bug has been reported to NCAR CSEG as bug-2439.

@amametjanov
Copy link
Member Author

Agreed about disabling the call to GetGlobalIndex. But acme.log also had the I/O recursion detected message:

13296:  WARNING:  snow balance error  
13296: XL Fortran (I/O initialization): I/O recursion detected.
2017-04-05 05:45:16.431 (WARN ) [0xfffb1aaca70] MIR-08000-3B7F1-4096:2236325:ibm.runjob.client.Job: terminated by signal 6
2017-04-05 05:45:16.431 (WARN ) [0xfffb1aaca70] MIR-08000-3B7F1-4096:2236325:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 13296

which @jayeshkrishna recommends to fix by surrounding the write statements with !$omp master ... !$omp end master. Could you push a fix to master? Needed by @PeterCaldwell.

BTW, any idea why an ne120-A_WCYCL run is producing 'WARNING: snow balance error ' and how to avoid it?

@bishtgautam
Copy link
Contributor

A shortcut fix is now open #1370

@amametjanov
Copy link
Member Author

amametjanov commented Jun 27, 2017

Any news on a possible resolution? I'm seeing snow balance error messages in the runs that hang with 4x16 PE layouts for LND.

Only one thread can be doing IO on a given unit at a time. A proposed patch for review is here: https://gist.github.com/amametjanov/48c7f1d6eb9a97391a97cb0155c9604e.

@amametjanov
Copy link
Member Author

One more related item from the same location: acme.log sometimes contains messages like

3625:  BalanceCheck: soil balance error (mm) 
3625:  nstep         =  420  
3625:  errsoi_col    =  -0.284474305140369060E-02
3625:  clm model is stopping

but the run continues because the call to endrun is commented out: https://github.com/ACME-Climate/ACME/blob/master/components/clm/src/biogeophys/BalanceCheckMod.F90#L673 . So this must be a warning instead of an error?

@thorntonpe
Copy link
Contributor

I have some concerns about having this balance check endrun commented out. We should turn it back on, and if we are concerned about fatal errors for trivial balance problems then we should loosen the tolerance.

@bishtgautam
Copy link
Contributor

@amametjanov: Regarding the proposed patch, can the !$omp critical be moved within the if (found) loop?

@thorntonpe: In clm4_5_1_r082 tag, the call endrun() on BalanceCheckMod.F90#L677 was commented out.

@amametjanov
Copy link
Member Author

amametjanov commented Jun 28, 2017

It's not allowed to branch into or out of a critical block: either all threads encounter it or none. It looks like the found condition might not be true for all threads and so had to surround the whole if-block with that.

Another alternative is to call a single write with a format statement like write(iulog,format_id): e.g. https://github.com/ACME-Climate/ACME/blob/master/components/cam/src/physics/cam/qneg3.F90#L121. But that is also error-prone, because, as the number of on-node MPIxOMP PEs increases, the chance of multiple PEs writing to the same iulog unit and causing an error or a hang goes up.

@bishtgautam
Copy link
Contributor

I should have read the restrictions section for OMP Critical more carefully. Your propsed solution looks good to me. Do you want to issue a PR or should I?

@amametjanov
Copy link
Member Author

Since balance check warning messages are relatively rare, let me test a lighter-weight approach of a single write call without any synchronization: patch here. If this + disabling all other diagnostic messages removes the hang/deadlock, will proceed with a PR. Thanks.

@bishtgautam
Copy link
Contributor

Probably adding //NEW_LINE('')// would split the output into mulitple lines and make it easire to read.

@amametjanov
Copy link
Member Author

Yes, but new_line is an intrinsic function added in Fortran 2003 standard. Didn't want to add a new constraint. Would //achar(10)// (F77) be okay? Or does it need to be //achar(10)//achar(13)//?

@bishtgautam
Copy link
Contributor

//achar(10)// should be sufficient. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants