Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure on next with SMS_D_Ln5.ne16_ne16.F1850C5AV1C-04.cori-knl_intel #1548

Closed
ndkeen opened this issue May 19, 2017 · 14 comments
Closed

Failure on next with SMS_D_Ln5.ne16_ne16.F1850C5AV1C-04.cori-knl_intel #1548

ndkeen opened this issue May 19, 2017 · 14 comments
Assignees

Comments

@ndkeen
Copy link
Contributor

ndkeen commented May 19, 2017

I know this next brings in MCT 2 files. Running acme_developer on cori-knl (and edison) and I found this.

000: forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable RSENDBUF when it is not allocated
000: 
000: Image              PC                Routine            Line        Source             
000: acme.exe           000000000B16B456  Unknown               Unknown  Unknown
000: acme.exe (deleted  0000000008C47169  Unknown               Unknown  Unknown
000: acme.exe (deleted  0000000008C0DC1B  Unknown               Unknown  Unknown
000: acme.exe (deleted  00000000087A0434  Unknown               Unknown  Unknown
000: acme.exe (deleted  000000000834CCD3  Unknown               Unknown  Unknown
000: acme.exe (deleted  000000000834A1FE  Unknown               Unknown  Unknown
000: acme.exe (deleted  000000000833F5DD  Unknown               Unknown  Unknown
000: acme.exe           0000000000450F6D  component_mod_mp_         227  component_mod.F90
000: acme.exe           000000000041C451  Unknown               Unknown  Unknown
000: acme.exe           0000000000447E26  Unknown               Unknown  Unknown
000: acme.exe (deleted  000000000040AFDE  Unknown               Unknown  Unknown
000: acme.exe (deleted  000000000B273120  Unknown               Unknown  Unknown

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/n24may19/SMS_D_Ln5.ne16_ne16.F1850C5AV1C-04.cori-knl_intel.20170519_145235_rxtwp4

@singhbalwinder
Copy link
Contributor

@noel: I would think that this bug would make all _D (debug) cases to fail. Is that the case or just this test is failing?

@ndkeen
Copy link
Contributor Author

ndkeen commented May 20, 2017

Yes, It looks like all of the SMS_D cases are failing (that was just the first to report). And I can now verify all SMS_D cases on edison are failing with next as well.

@rljacob
Copy link
Member

rljacob commented May 20, 2017

This is probably coming from the new code in mct/m_Rearranger.F90. CC'ing @worleyph. Why doesn't the trace continue in to MCT? That's also compiled with debugging in _D runs.

@rljacob rljacob self-assigned this May 20, 2017
@worleyph
Copy link
Contributor

@rljacob , I can't look at this until Sunday evening at the earliest.

@worleyph
Copy link
Contributor

@rljacob , can you do a grep for RSENDBUF (and rsendbuf)? Don't think that this is something I added.
Almost sounds like something related to mpi_rsend, which we tried to disable?

@rljacob
Copy link
Member

rljacob commented May 20, 2017

CSI0350919:mct jacob$ grep -i RSENDBUF *
m_Rearranger.F90:   real(FP),dimension(:),allocatable :: RSendBuf
m_Rearranger.F90:	allocate(RSendBuf(RSendSize),stat=ier)
m_Rearranger.F90:	if(ier/=0) call die(myname_,'allocate(RSendBuf)',ier)
m_Rearranger.F90:              RSendBuf(RSendLoc(proc)+k) = SourceAV%rAttr(AttrIndex,VectIndex)
m_Rearranger.F90:		 RSendBuf(RSendLoc(proc)+k) = SourceAV%rAttr(AttrIndex,VectIndex)
m_Rearranger.F90:	   call MPI_ISEND(RSendBuf(RSendLoc(proc)),                 &
m_Rearranger.F90:                      RSendBuf, RSendSize, RSendCnts, RSdispls, RTypes, &
m_Rearranger.F90:     call MPI_Alltoallv(RSendBuf, RSendCnts, RSdispls, mp_Type_rp, &
m_Rearranger.F90:	deallocate(RSendBuf,stat=ier)
m_Rearranger.F90:	if(ier/=0) call die(myname_,'deallocate(RSendBuf)',ier)

I'm still looking. Its in a call to m_swapm_FP but I don't see how its not allocated.

@worleyph
Copy link
Contributor

There was an issue (some compiler, some system) in which a loop with bounds "1,0" still got executed once. This would generate this error? Had to surround the loop with an if test, though this is still a fortran bug? Should be able to add some code to check this.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 20, 2017

I also ran the tests with GNU on cori and they all passed (except for HOMME, which has been having trouble with GNU for a while).

@rljacob
Copy link
Member

rljacob commented May 21, 2017

Yes all the next testing with gnu came back fine. Just intel-debugging catches this.

@rljacob
Copy link
Member

rljacob commented May 21, 2017

Intel with debugging on Anvil had a much better error message:

forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable RSENDBUF when it is not allocated

Image              PC                Routine            Line        Source     
acme.exe           0000000007EA8186  Unknown               Unknown  Unknown
acme.exe           0000000007CFD2C9  m_rearranger_mp_r        1165  m_Rearranger.F90
acme.exe           0000000007CC1FDE  m_matattrvectmul_         514  m_MatAttrVectMul.F90
acme.exe           00000000078721AE  shr_strdata_mod_m         736  shr_strdata_mod.F90
acme.exe           00000000074342CF  docn_comp_mod_mp_         652  docn_comp_mod.F90
acme.exe           000000000743185A  docn_comp_mod_mp_         528  docn_comp_mod.F90
acme.exe           0000000007428201  ocn_comp_mct_mp_o          61  ocn_comp_mct.F90
acme.exe           000000000045BF39  component_mod_mp_         227  component_mod.F90
acme.exe           000000000042750E  cesm_comp_mod_mp_        1191  cesm_comp_mod.F90
acme.exe           0000000000452E08  MAIN__                     63  cesm_driver.F90
acme.exe           000000000041611E  Unknown               Unknown  Unknown
libc-2.12.so       00002B7DCD30ED1D  __libc_start_main     Unknown  Unknown
acme.exe           0000000000415FA9  Unknown               Unknown  Unknown

That line is the new m_swapm_FP call but still not sure why it thinks it isn't allocated. It has the same conditional as around the MPI_Alltoallv call which never complained.

@rljacob
Copy link
Member

rljacob commented May 22, 2017

@apcraig what are these mct_sMat_avMult calls in shr_strdata_advance supposed to do? Any way they could be operating on Av's with a local length of 0?

@rljacob
Copy link
Member

rljacob commented May 22, 2017

Found the problem, the logic around the RSENDBUF allocation is

  if(SendRout%nprocs > 0) then
   if (numr .ge. 1) then
      allocate(RSendBuf....)

While the associated swap call (and MPI_alltoallv) just have "if (numr .ge. 1)".

A solution is to add an else clause to the allocate for the case where SendRout%nprocs = 0

  else
! the m_swap call needs these allocated even if
! SendRout%nprocs = 0. 
    if (useswapm) then
     if(numi .ge. 1)  allocate(ISendBuf(1),stat=ier)
     if(numr .ge. 1)  allocate(RSendBuf(1),stat=ier)
    endif
  endif

Had to do the same thing for RRecvBuf. Will add this to the branch locally and re-merge to next.

@worleyph
Copy link
Contributor

As I mentioned on PR #1452 it might be cleanest to just remove the 'if(SendRout%nprocs > 0)' etc. tests. The result is that zero length arrays will be allocated/deallocated in these edge cases, and everything should work fine?

@rljacob
Copy link
Member

rljacob commented May 23, 2017

Fixed by adding more commits in #1452

@rljacob rljacob closed this as completed May 23, 2017
jgfouca added a commit that referenced this issue Jun 2, 2017
…hecking

Improve user compset error checking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants