-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For cori-knl, update PE layout for F compsets at ne4 resolution. #3920
Conversation
SYPD current with this change 1 node 18.7 S 3 node 32.6 M default 13 node 52.1 M default L [bfb]
This change will make the default PE layout for ne4 F compsets be more efficient by using only 3 nodes. But keep the fast-as-possible Large compset and add a single node Small compset (for ensemble testing?) SYPD current with this change 1 node 18.7 S 3 node 32.6 M default 13 node 52.1 M default L [bfb]
Merged to next |
@ndkeen : I recently tried GNU on Cori KNL. The model blew up with the following error for
My naive
Let me know if I missed anything or if there is some flag I need to add to make it work with GNU. |
I don't think the root cause is the change of PE layouts. You don't have enough of the error message, but I suspect it's the same as what we had here
which happened after Cori did some software upgrades. The "fix" I ended up with was to simply alter the compiler flags, which allowed the tests to pass, but I knew it wasn't the right fix. I think that before this current PR, the ne4 cases were using 1 MPI per column (13 nodes) and no threading. You can still achieve that very same layout using "L"
In my recent GNU testing, I've also run into the same issue which proves that changing the compiler flags isn't a cure all. I can try debugging further, but I actually don't think it's a problem in our code. Here are some examples of runs (with master of Nov 19th) that failed in this way. But note many other tests work.
So I think we just don't yet know why GNU is not working in all cases on Cori. Note I've also tried other versions of GNU, but have not yet been able to build/run with most recent GNU 10 (issue about that as well). |
Thanks for this info. I didn't do any extensive testing except for the git bisect. It might just be a fluke that it worked for me twice (master and my branch) after I reverted this commit. GNU is working fine on other machines so it might just be a Cori-KNL issue (like you mentioned). This error message and the trace keeps on changing pointing to different files each time we run into this error but I am pasting here one of the complete error message for future reference:
|
This change will make the default PE layout for ne4 F compsets be more efficient by using only 3 nodes.
But keep the fast-as-possible Large compset and add a single node Small compset (for ensemble testing?)
Fixes #3939
[bfb]