Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uninitialized data in FillPatchTwoLevels #1147

Closed
maximumcats opened this issue Jul 25, 2020 · 9 comments
Closed

Uninitialized data in FillPatchTwoLevels #1147

maximumcats opened this issue Jul 25, 2020 · 9 comments
Assignees
Labels

Comments

@maximumcats
Copy link
Member

I have hit a case in wdmerger while regridding where FillPatchTwoLevels leaves certain data uninitialized in the coarse MultiFab, resulting in bad data accesses during the interpolation onto the fine MultiFab (or an FPE if running in debug mode with signaling NaNs enabled). As one example of where I see this in a run with amr.max_level = 3, the BoxArray on the coarse grid (on level 2) is

(BoxArray maxbox(12)
       m_ref->m_hash_sig(0)
       ((0,13696) (191,13759) (0,0)) ((320,13696) (383,13759) (0,0)) ((1664,14464) (1727,14655) (0,0)) ((1728,18496) (1791,18559) (0,0)) ((768,13696) (1215,13759) (0,0)) ((832,13760) (1215,13823) (0,0)) ((384,13696) (767,13759) (0,0)) ((1088,14080) (1215,14143) (0,0)) ((1088,14016) (1215,14079) (0,0)) ((832,13824) (1215,14015) (0,0)) ((1664,18240) (1727,18303) (0,0)) ((896,18752) (1215,18943) (0,0)) )

and a NaN is found in box ((768,13696) (1215,13759) (0,0)) in zone (1024, 13696).

@maximumcats maximumcats self-assigned this Jul 25, 2020
@maximumcats
Copy link
Member Author

maximumcats commented Jul 25, 2020

With fab.init_snan=1 on in a debug build, I can put a check in on mf.contains_nan() at the end of FillPatchSingleLevel and have a NaN detected, which I believe ought not to happen. The call to FillPatchSingleLevel is the one inside FillPatchTwoLevels_doit. In this case, we're taking the path with smf.size() == 1, so the work is just a ParallelCopy followed by a call to physbcf(). The source MultiFab comes back clean when checked for contains_nan().

@maximumcats
Copy link
Member Author

Here is the source MF's BoxArray. There is a box ((768,13568) (1023,13823) (0,0)) which covers the full index space of the destination box for j but only a chunk for i. It doesn't look like any box covers the remainder of the box that needs to be filled.

(BoxArray maxbox(200)
m_ref->m_hash_sig(0)
((0,13440) (255,13567) (0,0)) ((0,13824) (255,13951) (0,0)) ((0,14720) (127,14975) (0,0)) ((0,17920) (255,18175) (0,0)) ((1408,17920) (1663,18175) (0,0)) ((1920,17920) (2047,18047) (0,0)) ((2048,18304) (2303,18431) (0,0)) ((2048,18432) (2175,18559) (0,0)) ((1920,18304) (2047,18559) (0,0)) ((1408,13824) (1663,13951) (0,0)) ((640,13824) (895,13951) (0,0)) ((896,14720) (1151,14975) (0,0)) ((640,17920) (895,18175) (0,0)) ((0,17024) (127,17151) (0,0)) ((0,16256) (127,16511) (0,0)) ((0,15488) (127,15743) (0,0)) ((0,18560) (255,18815) (0,0)) ((896,17024) (1151,17151) (0,0)) ((896,16256) (1151,16511) (0,0)) ((896,15488) (1151,15743) (0,0)) ((640,18560) (895,18815) (0,0)) ((0,14208) (255,14463) (0,0)) ((0,14976) (127,15231) (0,0)) ((0,18176) (255,18303) (0,0)) ((1408,18432) (1663,18687) (0,0)) ((1408,14208) (1663,14463) (0,0)) ((640,14208) (895,14463) (0,0)) ((896,14976) (1151,15231) (0,0)) ((640,18176) (895,18303) (0,0)) ((0,17408) (127,17663) (0,0)) ((0,16512) (127,16767) (0,0)) ((0,15744) (127,15999) (0,0)) ((0,18816) (255,19071) (0,0)) ((896,17408) (1151,17663) (0,0)) ((896,16512) (1151,16767) (0,0)) ((896,15744) (1151,15999) (0,0)) ((640,18816) (895,19071) (0,0)) ((512,13440) (767,13567) (0,0)) ((256,13824) (383,13951) (0,0)) ((384,14720) (639,14975) (0,0)) ((256,17920) (383,18175) (0,0)) ((1664,13824) (1919,13951) (0,0)) ((896,13824) (1151,13951) (0,0)) ((1408,14720) (1663,14975) (0,0)) ((896,17920) (1151,18175) (0,0)) ((384,17024) (639,17151) (0,0)) ((384,16256) (639,16511) (0,0)) ((384,15488) (639,15743) (0,0)) ((256,18560) (383,18815) (0,0)) ((1408,17024) (1663,17151) (0,0)) ((1408,16256) (1663,16511) (0,0)) ((1408,15488) (1663,15743) (0,0)) ((896,18560) (1151,18815) (0,0)) ((256,14208) (383,14463) (0,0)) ((384,14976) (639,15231) (0,0)) ((256,18176) (383,18303) (0,0)) ((1664,14208) (1919,14463) (0,0)) ((896,14208) (1151,14463) (0,0)) ((1408,14976) (1663,15231) (0,0)) ((896,18176) (1151,18303) (0,0)) ((384,17408) (639,17663) (0,0)) ((384,16512) (639,16767) (0,0)) ((384,15744) (639,15999) (0,0)) ((256,18816) (383,19071) (0,0)) ((1408,17408) (1663,17663) (0,0)) ((1408,16512) (1663,16767) (0,0)) ((1408,15744) (1663,15999) (0,0)) ((896,18816) (1151,19071) (0,0)) ((0,13568) (255,13823) (0,0)) ((0,13952) (255,14207) (0,0)) ((1408,18176) (1663,18431) (0,0)) ((1920,18048) (2047,18303) (0,0)) ((1920,18560) (2047,18815) (0,0)) ((1408,13952) (1663,14207) (0,0)) ((640,13952) (895,14207) (0,0)) ((0,17152) (127,17407) (0,0)) ((896,17152) (1151,17407) (0,0)) ((0,14464) (255,14719) (0,0)) ((0,15232) (127,15487) (0,0)) ((0,18304) (255,18559) (0,0)) ((1408,18688) (1663,18943) (0,0)) ((1408,14464) (1663,14719) (0,0)) ((640,14464) (895,14719) (0,0)) ((896,15232) (1151,15487) (0,0)) ((640,18304) (895,18559) (0,0)) ((0,17664) (127,17919) (0,0)) ((0,16768) (127,17023) (0,0)) ((0,16000) (127,16255) (0,0)) ((0,19072) (255,19327) (0,0)) ((896,17664) (1151,17919) (0,0)) ((896,16768) (1151,17023) (0,0)) ((896,16000) (1151,16255) (0,0)) ((640,19072) (895,19327) (0,0)) ((512,13568) (767,13823) (0,0)) ((256,13952) (383,14207) (0,0)) ((1664,13952) (1919,14207) (0,0)) ((896,13952) (1151,14207) (0,0)) ((384,17152) (639,17407) (0,0)) ((1408,17152) (1663,17407) (0,0)) ((256,14464) (383,14719) (0,0)) ((384,15232) (639,15487) (0,0)) ((256,18304) (383,18559) (0,0)) ((1664,14464) (1919,14719) (0,0)) ((896,14464) (1151,14719) (0,0)) ((1408,15232) (1663,15487) (0,0)) ((896,18304) (1151,18559) (0,0)) ((384,17664) (639,17919) (0,0)) ((384,16768) (639,17023) (0,0)) ((384,16000) (639,16255) (0,0)) ((256,19072) (383,19327) (0,0)) ((1408,17664) (1663,17919) (0,0)) ((1408,16768) (1663,17023) (0,0)) ((1408,16000) (1663,16255) (0,0)) ((896,19072) (1151,19327) (0,0)) ((256,13440) (511,13567) (0,0)) ((128,14720) (383,14975) (0,0)) ((1664,17920) (1919,18175) (0,0)) ((2048,17920) (2303,18047) (0,0)) ((1152,14720) (1407,14975) (0,0)) ((128,17024) (383,17151) (0,0)) ((128,16256) (383,16511) (0,0)) ((128,15488) (383,15743) (0,0)) ((1152,17024) (1407,17151) (0,0)) ((1152,16256) (1407,16511) (0,0)) ((1152,15488) (1407,15743) (0,0)) ((128,14976) (383,15231) (0,0)) ((1664,18432) (1919,18687) (0,0)) ((1152,14976) (1407,15231) (0,0)) ((128,17408) (383,17663) (0,0)) ((128,16512) (383,16767) (0,0)) ((128,15744) (383,15999) (0,0)) ((1152,17408) (1407,17663) (0,0)) ((1152,16512) (1407,16767) (0,0)) ((1152,15744) (1407,15999) (0,0)) ((768,13440) (1023,13567) (0,0)) ((384,13824) (639,13951) (0,0)) ((640,14720) (895,14975) (0,0)) ((384,17920) (639,18175) (0,0)) ((1920,13824) (2175,13951) (0,0)) ((1152,13824) (1407,13951) (0,0)) ((1664,14720) (1919,14975) (0,0)) ((1152,17920) (1407,18175) (0,0)) ((640,17024) (895,17151) (0,0)) ((640,16256) (895,16511) (0,0)) ((640,15488) (895,15743) (0,0)) ((384,18560) (639,18815) (0,0)) ((1664,17024) (1919,17151) (0,0)) ((1664,16256) (1919,16511) (0,0)) ((1664,15488) (1919,15743) (0,0)) ((1152,18560) (1407,18815) (0,0)) ((384,14208) (639,14463) (0,0)) ((640,14976) (895,15231) (0,0)) ((384,18176) (639,18303) (0,0)) ((1920,14208) (2175,14463) (0,0)) ((1152,14208) (1407,14463) (0,0)) ((1664,14976) (1919,15231) (0,0)) ((1152,18176) (1407,18303) (0,0)) ((640,17408) (895,17663) (0,0)) ((640,16512) (895,16767) (0,0)) ((640,15744) (895,15999) (0,0)) ((384,18816) (639,19071) (0,0)) ((1664,17408) (1919,17663) (0,0)) ((1664,16512) (1919,16767) (0,0)) ((1664,15744) (1919,15999) (0,0)) ((1152,18816) (1407,19071) (0,0)) ((256,13568) (511,13823) (0,0)) ((1664,18176) (1919,18431) (0,0)) ((2048,18048) (2303,18303) (0,0)) ((128,17152) (383,17407) (0,0)) ((1152,17152) (1407,17407) (0,0)) ((128,15232) (383,15487) (0,0)) ((1664,18688) (1919,18943) (0,0)) ((1152,15232) (1407,15487) (0,0)) ((128,17664) (383,17919) (0,0)) ((128,16768) (383,17023) (0,0)) ((128,16000) (383,16255) (0,0)) ((1152,17664) (1407,17919) (0,0)) ((1152,16768) (1407,17023) (0,0)) ((1152,16000) (1407,16255) (0,0)) ((768,13568) (1023,13823) (0,0)) ((384,13952) (639,14207) (0,0)) ((1920,13952) (2175,14207) (0,0)) ((1152,13952) (1407,14207) (0,0)) ((640,17152) (895,17407) (0,0)) ((1664,17152) (1919,17407) (0,0)) ((384,14464) (639,14719) (0,0)) ((640,15232) (895,15487) (0,0)) ((384,18304) (639,18559) (0,0)) ((1920,14464) (2175,14719) (0,0)) ((1152,14464) (1407,14719) (0,0)) ((1664,15232) (1919,15487) (0,0)) ((1152,18304) (1407,18559) (0,0)) ((640,17664) (895,17919) (0,0)) ((640,16768) (895,17023) (0,0)) ((640,16000) (895,16255) (0,0)) ((384,19072) (639,19327) (0,0)) ((1664,17664) (1919,17919) (0,0)) ((1664,16768) (1919,17023) (0,0)) ((1664,16000) (1919,16255) (0,0)) ((1152,19072) (1407,19327) (0,0)) )

@maximumcats
Copy link
Member Author

There is a box in TheFPInfo() on the fine level that is not part of the intersection of the source and destination MultiFabs on the fine level, ((1536,27392) (2431,27519) (0,0)), which corresponds to this box on the coarse level. This looks like a correct result of the calculation based on examining the BoxArrays for mf and fmf: mf has a box ((1536,27392) (2431,27647) (0,0)) which needs to be filled, but the source fmf only has one box covering a part of this range, ((768,27520) (1663,28031) (0,0)).

@WeiqunZhang
Copy link
Member

Can you tell me how to reproduce this? There was a recent change in AMReX to regrid that might have introduced a bug.

@maximumcats
Copy link
Member Author

maximumcats commented Jul 26, 2020

I've placed the makefile and the checkpoint file in /gpfs/alpine/ast106/world-shared/castro_1147 on Summit. I was using the latest Castro/Microphysics and AMReX 20.07-85-g92418c3a0. It should be reproducible on four nodes with jsrun -n 128 -r 32 -c 1 -a 1 -X 1 -brs Castro2d.gnu.DEBUG.TPROF.MPI.ex inputs amr.restart=chk01612 amrex.fpe_trap_invalid=1.

@WeiqunZhang
Copy link
Member

Is this wdmerger?

@maximumcats
Copy link
Member Author

Yes, that's right. The makefile will build from wdmerger if you have CASTRO_HOME set.

@WeiqunZhang
Copy link
Member

AMReX-Codes/amrex#1204 @maxpkatz Can you try that PR?

@maximumcats
Copy link
Member Author

That looks like it resolves it, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants