Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"recursive I/O operation" error on edison with Intel compiler and threads #1216

Closed
ndkeen opened this issue Jan 12, 2017 · 7 comments
Closed
Assignees
Labels

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jan 12, 2017

Running acme_developer, stumbled upon this issue. Running without threads seems to avoid the issue. Looking closer, @jayeshkrishna and @bishtgautam noted that there are writes within a OMP loop

Gautam writes:

`call clm_ptrs_check(bounds_clump)` is happening within an OMP loop. (https://github.com/ACME-Climate/ACME/blob/master/components/clm/src/main/initGridCellsMod.F90#L216
Try commenting out the `if (masterproc) write(iulog,*) ` in `subroutine clm_ptrs_check ()` to see if the code proceeds further. (https://github.com/ACME-Climate/ACME/blob/master/components/clm/src/main/initSubgridMod.F90#L147)

A better solution would be: 
1. Pass `nc` to `clm_ptrs_check()` (i.e. `call clm_ptrs_check(nc, bounds_clump)`.
2. Modify write statements in `subroutine clm_ptrs_check ()`: `if (masterproc) write(iulog,*)` —> `if (masterproc .and. nc == 1) write(iulog,*)`

When I comment these writes, most of the issues are resolved.

001:  /project/projectdirs/acme/inputdata/lnd/clm2/surfdata_map/surfdata_0.9x1.25_sim
001:  yr1850_c150626.nc           0
000: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
000: Image              PC                Routine            Line        Source             
000: acme.exe           00000000021A071B  Unknown               Unknown  Unknown
000: acme.exe           0000000000567BD8  initsubgridmod_mp         178  initSubgridMod.F90
000: acme.exe           000000000055B58C  initgridcellsmod_         216  initGridCellsMod.F90
000: acme.exe           00000000022BC1C3  Unknown               Unknown  Unknown
000: acme.exe           00000000022924A0  Unknown               Unknown  Unknown
000: acme.exe           0000000002291715  Unknown               Unknown  Unknown
000: acme.exe           00000000022BC4B1  Unknown               Unknown  Unknown
000: acme.exe           0000000001B63B16  Unknown               Unknown  Unknown
000: acme.exe           000000000238E369  Unknown               Unknown  Unknown
srun: error: nid03849: task 0: Exited with exit code 40

However, there is at least still one similar issue. May be another example of multiple threads trying to write at same time in a different location.

00: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
00: Image              PC                Routine            Line        Source             
00: acme.exe           0000000001FA591E  Unknown               Unknown  Unknown
00: acme.exe           0000000000517CE2  clm_driver_mp_clm        1077  clm_driver.F90
00: acme.exe           00000000020B56F3  Unknown               Unknown  Unknown
m22inteld/ERS.f09_g16.I1850CLM45CN.edison_intel.20170111_230406/run/acme.log.170112-002939:000: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
m22inteld/ERS.f09_g16.IMCLM45BC.edison_intel.20170111_230406/run/acme.log.170112-004738:000: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
m22inteld/ERS.f19_f19.I1850CLM45CN.edison_intel.20170111_230406/run/acme.log.170112-010159:00: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
m22inteld/ERS.f19_f19.IM1850CLM45CN.edison_intel.20170111_230406/run/acme.log.170112-002931:00: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
m22inteld/ERS.f19_f19.IM1850CLM45CN.edison_intel.20170111_230406/run/acme.log.170112-002931:00: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
m22inteld/ERS.f19_f19.IMCLM45.edison_intel.20170111_230406/run/acme.log.170112-005714:00: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
m22inteld/ERS.f19_f19.IMCLM45.edison_intel.20170111_230406/run/acme.log.170112-005714:00: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
m22inteld/ERS.ne11_oQU240.I20TRCLM45.edison_intel.20170111_230406/run/acme.log.170112-004414:00: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
m22inteld/SMS.f09_g16_a.IGCLM45_MLI.edison_intel.20170111_230406/run/acme.log.170112-003222:000: forrtl: severe (40): recursive I/O operation, unit 97, file unknown
m22inteld/SMS_D_Ln1.ne30_ne30.FC5AV1C-04.edison_intel.20170111_230406/run/acme.log.170112-004106:000: forrtl: severe (40): recursive I/O operation, unit 96, file unknown
m22inteld/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intel.20170111_230406/run/acme.log.170112-010345:000: forrtl: severe (40): recursive I/O operation, unit 96, file unknown
@bishtgautam
Copy link
Contributor

bishtgautam commented Jan 13, 2017

@ndkeen : Which branch should one use to reproduce this error?

@jayeshkrishna
Copy link
Contributor

You might also want try out if

!$OMP MASTER
...
!$OMP END MASTER

around the print/write statements work too.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 13, 2017

In this case, I'm using master. I was attempting to modify PE layouts, but they don't impact this test.

@bishtgautam
Copy link
Contributor

@ndkeen : Without changes from PR-#1201, how is it that master is working on Edison?

@worleyph
Copy link
Contributor

@bishtgautam , what is your definition of "working"? PR-#1201 is to get performance data archiving working. Master runs now, though it is doing annoying things like saving performance data in subdirectories that only the owner can read and not doing checkpointing.

@rljacob rljacob added the Land label Feb 2, 2017
@jgfouca jgfouca reopened this May 18, 2017
@ndkeen
Copy link
Contributor Author

ndkeen commented May 19, 2017

Was going thru old issues. Reading this, I was reminded that I hit this same error not that long ago. Doing some greps, I see:
a) This has never happened on cori-haswell or cori-knl (at least looking for "unit 97" in acme.log*)
b) Since reporting this on Jan12th, I've run at least 25 acme_developers on edison (saving all of files) and "this same thing" has happened in 3 of those complete suite runs -- the most recent fail was May8th. When I say "the same thing", I mean ~9 tests (listed below) ALL fail in the same way. All other acme_dev tests do NOT have this error. So they either all fail in same way, or none fail.
c) The one thing that is different about the acme_dev tests that fail in this way, is that I was testing module changes that included using cray-mpich/7.4.1. Using older or newer versions of cray-mpich on edison does not seem to be a problem. 7.4.1 is the current default, but I've also been testing with 7.5.1. Version 7.5.1 is what I'm bringing in with PR #1533 and would be used when compiling with GNU or Intel v17.

ERS.f09_g16.I1850CLM45CN.edison_intel
ERS.f09_g16.IMCLM45BC.edison_intel
ERS.f19_f19.I1850CLM45CN.edison_intel
ERS.f19_f19.IM1850CLM45CN.edison_intel
ERS.f19_f19.IMCLM45.edison_intel
ERS.f19_g16.I1850CLM45.edison_intel.clm-betr
ERS.f19_g16.I1850CNECACNTBC.edison_intel.clm-eca
ERS.f19_g16.I1850CNECACTCBC.edison_intel.clm-eca
SMS.f09_g16_a.IGCLM45_MLI.edison_intel

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 27, 2017

I'm going to close as I don't see the tests failing now on edison. Sounds like it was related to a certain MPI version.

@ndkeen ndkeen closed this as completed Jul 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants