-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPAS-O PIO warnings during restart write #1138
Comments
@worleyph I have seen this same issue on LANL machines running mpas-o only. When these errors occur, the model appears to finish, but the output files with the Bad Return value from PIO errors have zero byte sizes. I have been able to circumvent these issues by using parallel-netcdf 1.5.0 instead of 1.6.0 (it appears 1.6.0 is default on mira) or higher. Unfortunately I don't understand why this error occurs. |
@vanroekel Thanks for your input here. I also have seen this output message, but have not pursued it. We use |
On Titan we are currently using
|
Can you try cray-parallel-netcdf/1.6.1? |
I'll see if I can get a reproducer with a low res case first. If so, sure. If not, this may have to wait until later in the week. |
@jayeshkrishna , as mentioned in the other issue page,
changed nothing. |
it looks like we need to sort this out asap. it is keeping me from spinning up the high res ocn/ice. i've tried a variety of processor combinations and pio strides and nothing is helping--i always get the MPAS IO ERROR message and zero length restart file for mpas-o. interesting that mpas-cice has no trouble writing a restart with the same number of cores as mpas-o. maybe not apples to apples, but perhaps related to the vertical dimension in the ocean? Pat, is there any possibility of getting the system people to install pnetcdf 1.5 just so we can (hopefully) get stuff running while we figure out what's going on? |
@maltrud , your requesting this (to help@olcf.ornl.gov) will have as much impact as my requesting it. You can also cc @mrnorman and @anantharajvg to see if they can lobby behind the scenes. Note that my "successful" run with full I/O by disabling the use of MPI_RSEND had no effect on this issue (error messages are still in there) so they are not related. |
thanks, @worleyph . the mpi_rsend issue did occur to me, so i'm glad you reminded me that it's a different issue. |
Note that I have not managed to reproduce this problem with an ne30_oEC resolution A_WCYCL2000 compset. @maltrud , what compset, resolution, and PE layout are you using that you see this? |
Copying text by @vanroekel at In running an MPAS-O case with approximately 3.6 million cells (RRS18to6) our log.err file has numerous instances of an error "Bad return value from PIO" and the file written has size 0. If we use pnetcdf/1.5.0, this does not happen, the output looks reasonable and valid (verified with ncdump and visualizing with paraview). After digging through the framework and comparing pnetcdf versions, it appears pnetcdf/1.5.0 works because there was a bug in that version that was remedied in later versions. In MPAS, we use NC_64BIT_OFFSET by default for output For cdf-2 files, any single variable cannot exceed 4GB in size. Any variable in my 18to6 run that is dimensioned nedges by nVertLevels (e.g., normalVelocity) has a size of ~8GB and thus violates this constraint. In pnetcdf/1.5.0 only the variable dimensions were accounted for and there was no consideration for the size of an element. This allowed us to pass the size check and proceed to file writes. This was remedied in pnetcdf/1.6.0 and we can no longer write using NC_64BIT_OFFSET. I still do not understand why I get valid output for an array that violates cdf-2 constraints, and am communicating with pnetcdf developers on this (see the discussion at https://trac.mcs.anl.gov/projects/parallel-netcdf/ticket/29). However, I think the more appropriate solution is to switch the default output to NC_64BIT_DATA (cdf-5), or at least allow easier use of this option. From what I can tell in the framework there is not an easy way to use NC_64BIT_DATA. If I look at this block from mpas_io.F
I can only get to the 64BIT_DATA option if master_pio_iotype is set. Yet I can't seem to find where this happens. I see no calls to |
@jayeshkrishna and @vanroekel , I think that this has been resolved? Please confirm (and close if so). I can run an experiment to verify if this would be useful. |
@worleyph agreed this should be closed. |
fixed by #1456 |
In high resolution coupled runs (-compset ACME_WCYCL2000 -res ne120_oRRS15) on Titan with restart writes enabled (REST_OPTION), warning messages are being generated at the end of the run of the form:
For a run on 120,000 cores, with 32768 processes for MPAS-O, there are 544 warnings output. For a run on 60,000 cores, with 16384 processes for MPAS-P, there are 267 warnings. These two runs had fixed strides, so perhaps the number of warnings is related to the number of PIO processes in the ocean? (512 and 256, respectively).
The warning do not appear to affect the run from completing normally or not.
ocn.log output immediately surrounding the warnings include (for a 1 day run, with restart):
Same thing occurs at the end of a month-long run, with restarts enabled. There are no warnings if REST_OPTION is set to never.
I have not looked at low resolution cases, nor at other systems. It should be simple to scan the ocean log files for any production jobs (using restarts) to see if similar wanring messages show up there.
The text was updated successfully, but these errors were encountered: