-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EC30to60 performance test hanging on Chrysalis with Gnu and OpenMPI #500
Comments
Same on Chicoma in latest testing |
I can confirm this behavior on chicoma. In the pr test suite I see:
This one also has trouble:
It appears to hang on this line in the log file, but sometimes recovers.
Watching the log file, it takes about 10 minutes to get through reading the namelist, which should just take a few seconds. This appears to be an i/o problem. I get the same behavior by simply running the Also hangs for several minutes here, again indicating an i/o problem:
|
On chrsalis, this simply hangs at this point in the log file:
On perlmutter it failed and then hangs on the
The
and the
|
@mark-petersen, do you think we just need to generate a more up-to-date cached mesh and initial condition? It seems worth a try. If that works, it would be a huge relief! |
I can at least try that right now. |
I ran the EC test cases without cached mesh and init, and I still get the hanging on Chrysalis with gnu and OpenMI. I'm trying ECwISC but I expect to find the same. So it's nothing to do with important missing variables in the initial condition, I think. Those warnings are a red herring. |
Yep, same for ECwISC. |
I have used |
Using print statements, I have traced the problem to: This is very strange! It seems that an Even more frustrating, it happens only in optimized mode. In debug mode, everything seems fine. |
@mark-petersen, any thoughts? |
Just a thought, is any data/initialization/flag expected to be shared among threads that is missing? |
@sarats, great suggestion! I had only tried with multiple threads so far. I'll try with 1 thread per core and see if the problem persists. |
@sarats, I tested again without OpenMP support, but the hanging behavior remains. I also looked at the configuration I've been running and it was already with a single thread before so it seems unlikely to be a threading issue. Even so, thank you for the suggestion. it's good that we seem to have eliminated that particular possibility. |
OK, I figured it out. It's actually the variable I was able to reproduce the error just after the merge of E3SM-Project/E3SM#5120 but the error does not occur just before. Once I removed
My theory on the cause does not explain why the innocent-looking E3SM-Project/E3SM#5120 would cause this. I can only say that the fix works, and compiler optimization is a finicky business. I will post a bug report and bug fix to E3SM tomorrow. |
Mark: Just curious - what was the optimization level used when it hangs? Was it |
It was with O3. |
Appears indeed to be fixed by E3SM-Project/E3SM#5575. I will close this issue once that PR has been merged and the E3SM-Project submodule here has been updated. |
There is no error message but the simulation never starts See:
The text was updated successfully, but these errors were encountered: