Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC30to60 performance test hanging on Chrysalis with Gnu and OpenMPI #500

Closed
xylar opened this issue Jan 12, 2023 · 17 comments · Fixed by #606 or #624
Closed

EC30to60 performance test hanging on Chrysalis with Gnu and OpenMPI #500

xylar opened this issue Jan 12, 2023 · 17 comments · Fixed by #606 or #624
Labels
bug Something isn't working ocean

Comments

@xylar
Copy link
Collaborator

xylar commented Jan 12, 2023

There is no error message but the simulation never starts See:

/lcrc/group/e3sm/ac.xylar/compass_1.2/chrysalis/test_20230111/ocean_pr_intel_gnu/ocean/global_ocean/EC30to60/PHC/performance_test/forward
@xylar xylar added the bug Something isn't working label Jan 12, 2023
@xylar xylar changed the title EC30to60 hanging on Chrysalis with Gnu and OpenMPI EC30to60 performance test hanging on Chrysalis with Gnu and OpenMPI Jan 12, 2023
@xylar xylar added the ocean label Jan 12, 2023
@xylar xylar mentioned this issue Mar 9, 2023
64 tasks
@xylar
Copy link
Collaborator Author

xylar commented Mar 9, 2023

Same on Chicoma in latest testing

@mark-petersen
Copy link
Collaborator

I can confirm this behavior on chicoma. In the pr test suite I see:

00:00 PASS ocean_global_ocean_EC30to60_mesh
00:00 PASS ocean_global_ocean_EC30to60_PHC_init
115:36 FAIL ocean_global_ocean_EC30to60_PHC_performance_test

This one also has trouble:

ocean/isomip_plus/planar/2km/z-star/Ocean0
  * step: process_geom
  * step: planar_mesh
  * step: cull_mesh
  * step: initial_state
  * step: ssh_adjustment

It appears to hang on this line in the log file, but sometimes recovers.

 Reading namelist from file namelist.ocean

Watching the log file, it takes about 10 minutes to get through reading the namelist, which should just take a few seconds. This appears to be an i/o problem. I get the same behavior by simply running the srun command, so this is unrelated to any compass interface.

Also hangs for several minutes here, again indicating an i/o problem:

  ** Attempting to bootstrap MPAS framework using stream: mesh
 Bootstrapping framework with mesh fields from input file 'adjusting_init.nc'

@mark-petersen
Copy link
Collaborator

On chrsalis, this simply hangs at this point in the log file:

ocean/global_ocean/EC30to60/PHC/performance_test
  * step: forward

pwd
/lcrc/group/e3sm/ac.mpetersen/scratch/runs/n/ocean_model_230322_c9201a4f_ch_gfortran_openmp_test_compass_EC/ocean/global_ocean/EC30to60/PHC/performance_test/forward

(dev_compass_1.2.0-alpha.5) chr:forward$ tail -f log.ocean.0000.out
WARNING: Variable avgTotalFreshWaterTemperatureFlux not in input file.
WARNING: Variable tidalPotentialEta not in input file.
WARNING: Variable nTidalPotentialConstituents not in input file.

On perlmutter it failed and then hangs on the ECwISC30to60:

pm:ocean_model_230322_c9201a4f_lo_gnu-cray_openmp_compassPR_EC$ cr
ocean/global_ocean/EC30to60/mesh
  test execution:      SUCCESS
  test runtime:        00:00
ocean/global_ocean/EC30to60/PHC/init
  test execution:      SUCCESS
  test runtime:        00:00
ocean/global_ocean/EC30to60/PHC/performance_test
  * step: forward
      Failed
  test execution:      ERROR
  see: case_outputs/ocean_global_ocean_EC30to60_PHC_performance_test.log
  test runtime:        00:10
ocean/global_ocean/ECwISC30to60/mesh
  test execution:      SUCCESS
  test runtime:        00:00
ocean/global_ocean/ECwISC30to60/PHC/init
  test execution:      SUCCESS
  test runtime:        00:00
ocean/global_ocean/ECwISC30to60/PHC/performance_test
  * step: forward

The ocean/global_ocean/EC30to60/PHC/performance_test ends here in the log file:

pwd
/pscratch/sd/m/mpeterse/runs/n/ocean_model_230322_c9201a4f_lo_gnu-cray_openmp_compassPR_EC/ocean/global_ocean/EC30to60/PHC/performance_test/forward

1141 WARNING: Variable filteredSSHGradientMeridional not in input file.
1142 WARNING: Variable avgTotalFreshWaterTemperatureFlux not in input file.
1143 WARNING: Variable tidalPotentialEta not in input file.
1144 WARNING: Variable nTidalPotentialConstituents not in input file.
1145 WARNING: Variable RediKappaData not in input file.

and the ocean/global_ocean/ECwISC30to60/PHC/performance_test hangs here in the log file:

pwd
/pscratch/sd/m/mpeterse/runs/n/ocean_model_230322_c9201a4f_lo_gnu-cray_openmp_compassPR_EC/ocean/global_ocean/ECwISC30to60/PHC/performance_test/forward

tail -n 5 log.ocean.0000.out
WARNING: Variable landIceDraft not in input file.
WARNING: Variable landIceFreshwaterFlux not in input file.
WARNING: Variable landIceHeatFlux not in input file.
WARNING: Variable heatFluxToLandIce not in input file.
WARNING: Variable tidalPotentialEta not in input file.

@xylar
Copy link
Collaborator Author

xylar commented Mar 22, 2023

@mark-petersen, do you think we just need to generate a more up-to-date cached mesh and initial condition? It seems worth a try. If that works, it would be a huge relief!

@xylar
Copy link
Collaborator Author

xylar commented Mar 22, 2023

I can at least try that right now.

@xylar
Copy link
Collaborator Author

xylar commented Mar 22, 2023

I ran the EC test cases without cached mesh and init, and I still get the hanging on Chrysalis with gnu and OpenMI. I'm trying ECwISC but I expect to find the same. So it's nothing to do with important missing variables in the initial condition, I think. Those warnings are a red herring.

@xylar
Copy link
Collaborator Author

xylar commented Mar 22, 2023

Yep, same for ECwISC.

@xylar
Copy link
Collaborator Author

xylar commented Mar 26, 2023

I have used git bisect together with adding a timeout to the model run call to trace this back to E3SM-Project/E3SM#5120.

@xylar
Copy link
Collaborator Author

xylar commented Mar 26, 2023

Using print statements, I have traced the problem to:
https://github.com/E3SM-Project/E3SM/blob/master/components/mpas-framework/src/framework/add_field_indices.inc#L33
and
https://github.com/E3SM-Project/E3SM/blob/master/components/mpas-framework/src/framework/mpas_dmpar.F#L745
for the variable RediKappaData.

This is very strange! It seems that an MPI_Allreduce is hanging. I don't see any changes in E3SM-Project/E3SM#5120 than explain this.

Even more frustrating, it happens only in optimized mode. In debug mode, everything seems fine.

@xylar
Copy link
Collaborator Author

xylar commented Mar 26, 2023

@mark-petersen, any thoughts?

@sarats
Copy link

sarats commented Mar 30, 2023

Just a thought, is any data/initialization/flag expected to be shared among threads that is missing?
It looks like the MPI_Allreduce is issued by thread 0. To rule out threading related issue, you can try running this in pure MPI mode.

https://github.com/E3SM-Project/E3SM/blob/4deb2611a4293fdb578db5dd1ba9fd7a6c223029/components/mpas-framework/src/framework/mpas_dmpar.F#L743

@xylar
Copy link
Collaborator Author

xylar commented Mar 31, 2023

@sarats, great suggestion! I had only tried with multiple threads so far. I'll try with 1 thread per core and see if the problem persists.

@xylar
Copy link
Collaborator Author

xylar commented Mar 31, 2023

@sarats, I tested again without OpenMP support, but the hanging behavior remains.

I also looked at the configuration I've been running and it was already with a single thread before so it seems unlikely to be a threading issue. Even so, thank you for the suggestion. it's good that we seem to have eliminated that particular possibility.

@mark-petersen
Copy link
Collaborator

OK, I figured it out. It's actually the variable RediKappaData that is causing the problem. That variable is declared in the Registry but is never actually used. So I'm guessing that the compiler, in its optimizing exuberance, got rid of some underlying information about the array, and then the MPI communication hangs when it communicates the size of the array.

I was able to reproduce the error just after the merge of E3SM-Project/E3SM#5120 but the error does not occur just before. Once I removed RediKappaData everything works fine, without the hang. I was testing on chrysalis with EC30to60 performance tests, here:

/lcrc/group/e3sm/ac.mpetersen/scratch/runs/ocean_model_230404_c5f8b378_ch_gfortran_openmp_after_5120/ocean/global_ocean/EC30to60/PHC/performance_test/forward

My theory on the cause does not explain why the innocent-looking E3SM-Project/E3SM#5120 would cause this. I can only say that the fix works, and compiler optimization is a finicky business.

I will post a bug report and bug fix to E3SM tomorrow.

@sarats
Copy link

sarats commented Apr 5, 2023

optimizing exuberance

Mark: Just curious - what was the optimization level used when it hangs? Was it O3 or even lower?

@mark-petersen
Copy link
Collaborator

It was with O3.

@xylar
Copy link
Collaborator Author

xylar commented Apr 12, 2023

Appears indeed to be fixed by E3SM-Project/E3SM#5575. I will close this issue once that PR has been merged and the E3SM-Project submodule here has been updated.

@xylar xylar mentioned this issue Apr 29, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ocean
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants