-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure on next with SMS_D_Ln5.ne16_ne16.F1850C5AV1C-04.cori-knl_intel #1548
Comments
@noel: I would think that this bug would make all _D (debug) cases to fail. Is that the case or just this test is failing? |
Yes, It looks like all of the SMS_D cases are failing (that was just the first to report). And I can now verify all SMS_D cases on edison are failing with next as well. |
This is probably coming from the new code in mct/m_Rearranger.F90. CC'ing @worleyph. Why doesn't the trace continue in to MCT? That's also compiled with debugging in _D runs. |
@rljacob , I can't look at this until Sunday evening at the earliest. |
@rljacob , can you do a grep for RSENDBUF (and rsendbuf)? Don't think that this is something I added. |
I'm still looking. Its in a call to m_swapm_FP but I don't see how its not allocated. |
There was an issue (some compiler, some system) in which a loop with bounds "1,0" still got executed once. This would generate this error? Had to surround the loop with an if test, though this is still a fortran bug? Should be able to add some code to check this. |
I also ran the tests with GNU on cori and they all passed (except for HOMME, which has been having trouble with GNU for a while). |
Yes all the next testing with gnu came back fine. Just intel-debugging catches this. |
Intel with debugging on Anvil had a much better error message:
That line is the new m_swapm_FP call but still not sure why it thinks it isn't allocated. It has the same conditional as around the MPI_Alltoallv call which never complained. |
@apcraig what are these mct_sMat_avMult calls in shr_strdata_advance supposed to do? Any way they could be operating on Av's with a local length of 0? |
Found the problem, the logic around the RSENDBUF allocation is
While the associated swap call (and MPI_alltoallv) just have "if (numr .ge. 1)". A solution is to add an else clause to the allocate for the case where SendRout%nprocs = 0
Had to do the same thing for RRecvBuf. Will add this to the branch locally and re-merge to next. |
As I mentioned on PR #1452 it might be cleanest to just remove the 'if(SendRout%nprocs > 0)' etc. tests. The result is that zero length arrays will be allocated/deallocated in these edge cases, and everything should work fine? |
Fixed by adding more commits in #1452 |
…hecking Improve user compset error checking
I know this next brings in MCT 2 files. Running acme_developer on cori-knl (and edison) and I found this.
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/n24may19/SMS_D_Ln5.ne16_ne16.F1850C5AV1C-04.cori-knl_intel.20170519_145235_rxtwp4
The text was updated successfully, but these errors were encountered: