Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPAS-O PIO warnings during restart write #1138

Closed
worleyph opened this issue Nov 13, 2016 · 14 comments
Closed

MPAS-O PIO warnings during restart write #1138

worleyph opened this issue Nov 13, 2016 · 14 comments

Comments

@worleyph
Copy link
Contributor

In high resolution coupled runs (-compset ACME_WCYCL2000 -res ne120_oRRS15) on Titan with restart writes enabled (REST_OPTION), warning messages are being generated at the end of the run of the form:

  MPAS IO Error: Bad return value from PIO

For a run on 120,000 cores, with 32768 processes for MPAS-O, there are 544 warnings output. For a run on 60,000 cores, with 16384 processes for MPAS-P, there are 267 warnings. These two runs had fixed strides, so perhaps the number of warnings is related to the number of PIO processes in the ocean? (512 and 256, respectively).

The warning do not appear to affect the run from completing normally or not.

ocn.log output immediately surrounding the warnings include (for a 1 day run, with restart):

 ...
  Completed timestep 0001-01-02_00:00:00
  Writing restart streams
     Writing stream restart
  MPAS IO Error: Bad return value from PIO
  MPAS IO Error: Bad return value from PIO
  MPAS IO Error: Bad return value from PIO
  MPAS IO Error: Bad return value from PIO
  MPAS IO Error: Bad return value from PIO
     Finished writing stream restart
     Writing stream timeFiltersRestart
     Finished writing stream timeFiltersRestart
     Writing stream eliassenPalmRestart
     Finished writing stream eliassenPalmRestart
     Writing stream timeSeriesStatsRestart
  MPAS IO Error: Bad return value from PIO
 ...
  MPAS IO Error: Bad return value from PIO
     Finished writing stream timeSeriesStatsRestart
  Finished writing restart streams
  Exporting ocean state
  Finished exporting ocean state
        Finalizing AM globalStats
        Finalizing AM layerVolumeWeightedAverage
        Finalizing AM meridionalHeatTransport
        Finalizing AM surfaceAreaWeightedAverages
        Finalizing AM mixedLayerDepths
        Finalizing AM timeSeriesStats

Same thing occurs at the end of a month-long run, with restarts enabled. There are no warnings if REST_OPTION is set to never.

I have not looked at low resolution cases, nor at other systems. It should be simple to scan the ocean log files for any production jobs (using restarts) to see if similar wanring messages show up there.

@vanroekel
Copy link
Contributor

@worleyph I have seen this same issue on LANL machines running mpas-o only. When these errors occur, the model appears to finish, but the output files with the Bad Return value from PIO errors have zero byte sizes. I have been able to circumvent these issues by using parallel-netcdf 1.5.0 instead of 1.6.0 (it appears 1.6.0 is default on mira) or higher. Unfortunately I don't understand why this error occurs.

@mark-petersen
Copy link
Contributor

@vanroekel Thanks for your input here. I also have seen this output message, but have not pursued it. We use parallel-netcdf/1.5.0 in our standard module configuration for MPAS-O. Obviously, we need to be able to run with higher versions in the long term. We just have not looked at it more carefully.

@worleyph
Copy link
Contributor Author

On Titan we are currently using

 module add cray-parallel-netcdf/1.7.0

@jayeshkrishna
Copy link
Contributor

Can you try cray-parallel-netcdf/1.6.1?

@worleyph
Copy link
Contributor Author

I'll see if I can get a reproducer with a low res case first. If so, sure. If not, this may have to wait until later in the week.

@worleyph
Copy link
Contributor Author

@jayeshkrishna , as mentioned in the other issue page,

 Can you try cray-parallel-netcdf/1.6.1?

changed nothing.

@maltrud
Copy link
Contributor

maltrud commented Nov 20, 2016

it looks like we need to sort this out asap. it is keeping me from spinning up the high res ocn/ice. i've tried a variety of processor combinations and pio strides and nothing is helping--i always get the MPAS IO ERROR message and zero length restart file for mpas-o. interesting that mpas-cice has no trouble writing a restart with the same number of cores as mpas-o. maybe not apples to apples, but perhaps related to the vertical dimension in the ocean?

Pat, is there any possibility of getting the system people to install pnetcdf 1.5 just so we can (hopefully) get stuff running while we figure out what's going on?

@worleyph
Copy link
Contributor Author

worleyph commented Nov 20, 2016

Pat, is there any possibility of getting the system people to install pnetcdf 1.5 just so we can (hopefully) get stuff running while we figure out what's going on?

@maltrud , your requesting this (to help@olcf.ornl.gov) will have as much impact as my requesting it. You can also cc @mrnorman and @anantharajvg to see if they can lobby behind the scenes.

Note that my "successful" run with full I/O by disabling the use of MPI_RSEND had no effect on this issue (error messages are still in there) so they are not related.

@maltrud
Copy link
Contributor

maltrud commented Nov 20, 2016

thanks, @worleyph . the mpi_rsend issue did occur to me, so i'm glad you reminded me that it's a different issue.

@worleyph
Copy link
Contributor Author

Note that I have not managed to reproduce this problem with an ne30_oEC resolution A_WCYCL2000 compset. @maltrud , what compset, resolution, and PE layout are you using that you see this?

@mark-petersen
Copy link
Contributor

Copying text by @vanroekel at
MPAS-Dev/MPAS#1279
relevant to this issue:

In running an MPAS-O case with approximately 3.6 million cells (RRS18to6) our log.err file has numerous instances of an error "Bad return value from PIO" and the file written has size 0. If we use pnetcdf/1.5.0, this does not happen, the output looks reasonable and valid (verified with ncdump and visualizing with paraview).

After digging through the framework and comparing pnetcdf versions, it appears pnetcdf/1.5.0 works because there was a bug in that version that was remedied in later versions. In MPAS, we use NC_64BIT_OFFSET by default for output For cdf-2 files, any single variable cannot exceed 4GB in size. Any variable in my 18to6 run that is dimensioned nedges by nVertLevels (e.g., normalVelocity) has a size of ~8GB and thus violates this constraint. In pnetcdf/1.5.0 only the variable dimensions were accounted for and there was no consideration for the size of an element. This allowed us to pass the size check and proceed to file writes. This was remedied in pnetcdf/1.6.0 and we can no longer write using NC_64BIT_OFFSET. I still do not understand why I get valid output for an array that violates cdf-2 constraints, and am communicating with pnetcdf developers on this (see the discussion at https://trac.mcs.anl.gov/projects/parallel-netcdf/ticket/29). However, I think the more appropriate solution is to switch the default output to NC_64BIT_DATA (cdf-5), or at least allow easier use of this option. From what I can tell in the framework there is not an easy way to use NC_64BIT_DATA. If I look at this block from mpas_io.F

      if (ioContext % master_pio_iotype /= -999) then
         pio_iotype = ioContext % master_pio_iotype
         pio_mode = PIO_64BIT_OFFSET
      else
         if (ioformat == MPAS_IO_PNETCDF) then
            pio_iotype = PIO_iotype_pnetcdf
            pio_mode = PIO_64BIT_OFFSET
         else if (ioformat == MPAS_IO_PNETCDF5) then
            pio_iotype = PIO_iotype_pnetcdf
            pio_mode = PIO_64BIT_DATA
         else if (ioformat == MPAS_IO_NETCDF) then
            pio_iotype = PIO_iotype_netcdf
            pio_mode = PIO_64BIT_OFFSET
         else if (ioformat == MPAS_IO_NETCDF4) then
            pio_iotype = PIO_iotype_netcdf4p
            pio_mode = PIO_64BIT_OFFSET
         end if
      end if

I can only get to the 64BIT_DATA option if master_pio_iotype is set. Yet I can't seem to find where this happens. I see no calls to MPAS_io_set_iotype. Am I missing it? Or is 64BIT_OFFSET currently the only option for output? If so, is it possible to change this? 64BIT_DATA seems to work in my tests with modified framework, but don't know if I'm missing something else about why output is only written in cdf-2.

@worleyph
Copy link
Contributor Author

worleyph commented Apr 4, 2018

@jayeshkrishna and @vanroekel , I think that this has been resolved? Please confirm (and close if so). I can run an experiment to verify if this would be useful.

@vanroekel
Copy link
Contributor

@worleyph agreed this should be closed.

@vanroekel
Copy link
Contributor

fixed by #1456

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants