MPAS-O PIO warnings during restart write #1138

worleyph · 2016-11-13T13:36:09Z

In high resolution coupled runs (-compset ACME_WCYCL2000 -res ne120_oRRS15) on Titan with restart writes enabled (REST_OPTION), warning messages are being generated at the end of the run of the form:

  MPAS IO Error: Bad return value from PIO

For a run on 120,000 cores, with 32768 processes for MPAS-O, there are 544 warnings output. For a run on 60,000 cores, with 16384 processes for MPAS-P, there are 267 warnings. These two runs had fixed strides, so perhaps the number of warnings is related to the number of PIO processes in the ocean? (512 and 256, respectively).

The warning do not appear to affect the run from completing normally or not.

ocn.log output immediately surrounding the warnings include (for a 1 day run, with restart):

 ...
  Completed timestep 0001-01-02_00:00:00
  Writing restart streams
     Writing stream restart
  MPAS IO Error: Bad return value from PIO
  MPAS IO Error: Bad return value from PIO
  MPAS IO Error: Bad return value from PIO
  MPAS IO Error: Bad return value from PIO
  MPAS IO Error: Bad return value from PIO
     Finished writing stream restart
     Writing stream timeFiltersRestart
     Finished writing stream timeFiltersRestart
     Writing stream eliassenPalmRestart
     Finished writing stream eliassenPalmRestart
     Writing stream timeSeriesStatsRestart
  MPAS IO Error: Bad return value from PIO
 ...
  MPAS IO Error: Bad return value from PIO
     Finished writing stream timeSeriesStatsRestart
  Finished writing restart streams
  Exporting ocean state
  Finished exporting ocean state
        Finalizing AM globalStats
        Finalizing AM layerVolumeWeightedAverage
        Finalizing AM meridionalHeatTransport
        Finalizing AM surfaceAreaWeightedAverages
        Finalizing AM mixedLayerDepths
        Finalizing AM timeSeriesStats

Same thing occurs at the end of a month-long run, with restarts enabled. There are no warnings if REST_OPTION is set to never.

I have not looked at low resolution cases, nor at other systems. It should be simple to scan the ocean log files for any production jobs (using restarts) to see if similar wanring messages show up there.

The text was updated successfully, but these errors were encountered:

vanroekel · 2016-11-14T04:58:06Z

@worleyph I have seen this same issue on LANL machines running mpas-o only. When these errors occur, the model appears to finish, but the output files with the Bad Return value from PIO errors have zero byte sizes. I have been able to circumvent these issues by using parallel-netcdf 1.5.0 instead of 1.6.0 (it appears 1.6.0 is default on mira) or higher. Unfortunately I don't understand why this error occurs.

mark-petersen · 2016-11-14T14:31:26Z

@vanroekel Thanks for your input here. I also have seen this output message, but have not pursued it. We use parallel-netcdf/1.5.0 in our standard module configuration for MPAS-O. Obviously, we need to be able to run with higher versions in the long term. We just have not looked at it more carefully.

worleyph · 2016-11-14T14:37:10Z

On Titan we are currently using

 module add cray-parallel-netcdf/1.7.0

jayeshkrishna · 2016-11-14T16:32:04Z

Can you try cray-parallel-netcdf/1.6.1?

worleyph · 2016-11-14T16:35:21Z

I'll see if I can get a reproducer with a low res case first. If so, sure. If not, this may have to wait until later in the week.

worleyph · 2016-11-15T02:21:00Z

@jayeshkrishna , as mentioned in the other issue page,

 Can you try cray-parallel-netcdf/1.6.1?

changed nothing.

maltrud · 2016-11-20T19:20:49Z

it looks like we need to sort this out asap. it is keeping me from spinning up the high res ocn/ice. i've tried a variety of processor combinations and pio strides and nothing is helping--i always get the MPAS IO ERROR message and zero length restart file for mpas-o. interesting that mpas-cice has no trouble writing a restart with the same number of cores as mpas-o. maybe not apples to apples, but perhaps related to the vertical dimension in the ocean?

Pat, is there any possibility of getting the system people to install pnetcdf 1.5 just so we can (hopefully) get stuff running while we figure out what's going on?

worleyph · 2016-11-20T19:57:04Z

Pat, is there any possibility of getting the system people to install pnetcdf 1.5 just so we can (hopefully) get stuff running while we figure out what's going on?

@maltrud , your requesting this (to help@olcf.ornl.gov) will have as much impact as my requesting it. You can also cc @mrnorman and @anantharajvg to see if they can lobby behind the scenes.

Note that my "successful" run with full I/O by disabling the use of MPI_RSEND had no effect on this issue (error messages are still in there) so they are not related.

maltrud · 2016-11-20T22:50:28Z

thanks, @worleyph . the mpi_rsend issue did occur to me, so i'm glad you reminded me that it's a different issue.

worleyph · 2016-11-20T23:30:51Z

Note that I have not managed to reproduce this problem with an ne30_oEC resolution A_WCYCL2000 compset. @maltrud , what compset, resolution, and PE layout are you using that you see this?

mark-petersen · 2017-04-06T15:59:15Z

Copying text by @vanroekel at
MPAS-Dev/MPAS#1279
relevant to this issue:

In running an MPAS-O case with approximately 3.6 million cells (RRS18to6) our log.err file has numerous instances of an error "Bad return value from PIO" and the file written has size 0. If we use pnetcdf/1.5.0, this does not happen, the output looks reasonable and valid (verified with ncdump and visualizing with paraview).

After digging through the framework and comparing pnetcdf versions, it appears pnetcdf/1.5.0 works because there was a bug in that version that was remedied in later versions. In MPAS, we use NC_64BIT_OFFSET by default for output For cdf-2 files, any single variable cannot exceed 4GB in size. Any variable in my 18to6 run that is dimensioned nedges by nVertLevels (e.g., normalVelocity) has a size of ~8GB and thus violates this constraint. In pnetcdf/1.5.0 only the variable dimensions were accounted for and there was no consideration for the size of an element. This allowed us to pass the size check and proceed to file writes. This was remedied in pnetcdf/1.6.0 and we can no longer write using NC_64BIT_OFFSET. I still do not understand why I get valid output for an array that violates cdf-2 constraints, and am communicating with pnetcdf developers on this (see the discussion at https://trac.mcs.anl.gov/projects/parallel-netcdf/ticket/29). However, I think the more appropriate solution is to switch the default output to NC_64BIT_DATA (cdf-5), or at least allow easier use of this option. From what I can tell in the framework there is not an easy way to use NC_64BIT_DATA. If I look at this block from mpas_io.F

      if (ioContext % master_pio_iotype /= -999) then
         pio_iotype = ioContext % master_pio_iotype
         pio_mode = PIO_64BIT_OFFSET
      else
         if (ioformat == MPAS_IO_PNETCDF) then
            pio_iotype = PIO_iotype_pnetcdf
            pio_mode = PIO_64BIT_OFFSET
         else if (ioformat == MPAS_IO_PNETCDF5) then
            pio_iotype = PIO_iotype_pnetcdf
            pio_mode = PIO_64BIT_DATA
         else if (ioformat == MPAS_IO_NETCDF) then
            pio_iotype = PIO_iotype_netcdf
            pio_mode = PIO_64BIT_OFFSET
         else if (ioformat == MPAS_IO_NETCDF4) then
            pio_iotype = PIO_iotype_netcdf4p
            pio_mode = PIO_64BIT_OFFSET
         end if
      end if

I can only get to the 64BIT_DATA option if master_pio_iotype is set. Yet I can't seem to find where this happens. I see no calls to MPAS_io_set_iotype. Am I missing it? Or is 64BIT_OFFSET currently the only option for output? If so, is it possible to change this? 64BIT_DATA seems to work in my tests with modified framework, but don't know if I'm missing something else about why output is only written in cdf-2.

worleyph · 2018-04-04T02:06:11Z

@jayeshkrishna and @vanroekel , I think that this has been resolved? Please confirm (and close if so). I can run an experiment to verify if this would be useful.

vanroekel · 2018-04-04T03:39:04Z

@worleyph agreed this should be closed.

vanroekel · 2018-04-04T03:39:13Z

fixed by #1456

worleyph added mpas-ocean PotentialBug labels Nov 13, 2016

worleyph assigned jayeshkrishna and jonbob Nov 13, 2016

jonbob mentioned this issue Mar 13, 2017

Issues running the new v3 G-case at 18to6 res on cori-knl #1307

Closed

jgfouca closed this as completed in 7ef40c0 Apr 19, 2017

jgfouca reopened this Apr 19, 2017

jayeshkrishna mentioned this issue Apr 25, 2017

Default high-res GMPAS-IAF case fails on TITAN due to PIO/PNETCDF bug #1451

Closed

jayeshkrishna mentioned this issue May 17, 2017

Failure writing high-res CAM restart file on titan #1544

Open

vanroekel closed this as completed Apr 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPAS-O PIO warnings during restart write #1138

MPAS-O PIO warnings during restart write #1138

worleyph commented Nov 13, 2016

vanroekel commented Nov 14, 2016

mark-petersen commented Nov 14, 2016

worleyph commented Nov 14, 2016

jayeshkrishna commented Nov 14, 2016

worleyph commented Nov 14, 2016

worleyph commented Nov 15, 2016

maltrud commented Nov 20, 2016

worleyph commented Nov 20, 2016 •

edited

Loading

maltrud commented Nov 20, 2016

worleyph commented Nov 20, 2016

mark-petersen commented Apr 6, 2017

worleyph commented Apr 4, 2018

vanroekel commented Apr 4, 2018

vanroekel commented Apr 4, 2018

MPAS-O PIO warnings during restart write #1138

MPAS-O PIO warnings during restart write #1138

Comments

worleyph commented Nov 13, 2016

vanroekel commented Nov 14, 2016

mark-petersen commented Nov 14, 2016

worleyph commented Nov 14, 2016

jayeshkrishna commented Nov 14, 2016

worleyph commented Nov 14, 2016

worleyph commented Nov 15, 2016

maltrud commented Nov 20, 2016

worleyph commented Nov 20, 2016 • edited Loading

maltrud commented Nov 20, 2016

worleyph commented Nov 20, 2016

mark-petersen commented Apr 6, 2017

worleyph commented Apr 4, 2018

vanroekel commented Apr 4, 2018

vanroekel commented Apr 4, 2018

worleyph commented Nov 20, 2016 •

edited

Loading