Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Trap NaNs, Divisions by Zero, Overflows #2205

Merged
merged 1 commit into from
Aug 27, 2021

Conversation

EZoni
Copy link
Member

@EZoni EZoni commented Aug 18, 2021

Trap NaNs, divisions by zero, and overflows when running our CI tests.

@EZoni EZoni added the component: tests Tests and CI label Aug 18, 2021
@EZoni EZoni changed the title CI: Trap NaNs, Divisions by Zero, Overflows [WIP] CI: Trap NaNs, Divisions by Zero, Overflows Aug 18, 2021
@EZoni
Copy link
Member Author

EZoni commented Aug 18, 2021

CI tests that crashed or failed:

@EZoni
Copy link
Member Author

EZoni commented Aug 18, 2021

Regarding the CI test collisionXZ, I ran it locally, on this branch, with run_test.sh, in DEBUG mode, and found that we try to compute the square root of a negative number here, because both T1t and T2t are equal to -1.0:

lmdD = T_R(1.0)/std::sqrt( n1*q1*q1/(T1t*PhysConst::ep0) +
n2*q2*q2/(T2t*PhysConst::ep0) );

I think the same is happening with the CI test collisionXYZ.

@RemiLehe @Yin-YinjianZhao
Does this look like a bug to you, too? Would you be able to understand why this happens and fix it, or shall I try to look into this?

Update Bug fix in #2225.

@EZoni
Copy link
Member Author

EZoni commented Aug 19, 2021

Regarding the CI test momentum-conserving-gather there are some NaNs in the arrays Bxg, Byg, Bzg, Exg, Eyg, Ezg used in the kernel Source/Parallelization/WarpXComm_K.H. So the crash is caused by activating amrex.fpe_trap_invalid=1. The NaNs can be observed by running this test locally, on this branch, with run_test.sh, in DEBUG mode, but setting amrex.fpe_trap_invalid=0 (so overwriting this particular change in this PR, otherwise the code crashes without printing).

For example, looking at Bxg, it is passed as bx_c here

warpx_interp_nd_bfield_x(j,k,l, bx_aux, bx_fp, bx_cp, bx_c);

and bx_c is computed from Btmp here
Array4<Real const> const& bx_c = Btmp[0]->const_array(mfi);

which, in turn, is computed here
// ParallelCopy from coarse level
for (int i = 0; i < 3; ++i) {
IntVect ng = Btmp[i]->nGrowVect();
// Guard cells may not be up to date beyond ng_FieldGather
const amrex::IntVect& ng_src = guard_cells.ng_FieldGather;
// Copy Bfield_aux to Btmp, using up to ng_src (=ng_FieldGather) guard cells from
// Bfield_aux and filling up to ng (=nGrow) guard cells in Btmp
Btmp[i]->ParallelCopy(*Bfield_aux[lev-1][i], 0, 0, 1, ng_src, ng, cperiod);
}

So I think these NaNs have something to do with the recent changes in #2144. I can confirm that the NaNs disappear if we revert the changes in #2144 (namely if we replace ng_src with ng everywhere). What do you think, @NeilZaim?

@NeilZaim
Copy link
Member

@EZoni Thanks for adding this PR, that's a great idea!
I could reproduce your issue regarding #2144. Maybe this PR was indeed a bit premature, since as you observed we can have some unititialized values if we don't use all the guard cells in the ParallelCopy (I don't think it should affect the regular cells too much, and in fact #2144 didn't break CI, but it's still dangerous to leave unititialized values hanging around).
I don't think we should simply revert ng_src to ng either because then nonupdated guard cells can affect regular cells when the calls the FillBoundary are done with ng_FieldGather guard cells. (the initial issue that prompted #2144)
In fact, it looks like calling FillBoundary with sometimes ng_alloc_EB guard cells and sometimes ng_FieldGather guard cells can be a bit tricky, precisely because of these type of issues. So I'm thinking that maybe for now we should always call FillBoundary with ng_alloc_EB guard cells to ensure we have a correct code and then iterate from here. Initially I did not want to do that because I was afraid of exchanging all the guard cells twice per timestep with PSATD, but in fact with PSATD we somehow use ng_FieldGather = ng_alloc_EB, meaning that we are already exchanging all the guard cells twice per timestep. (I'll try to improve that afterwards because this should significantly slow down the simulations in 3D).
On the other hand, using ng_alloc_EB everywhere would result in a small performance penalty with FDTD (more guard cell exchanges) but I feel like starting with a code working in all situations would make improving on this much simpler and be worth it in the long run. What do you think about this? Do you see another simple solution? If what I proposed makes sense to you, I can do this in a quick PR.

@EZoni
Copy link
Member Author

EZoni commented Aug 19, 2021

@NeilZaim Thank you for your comments. Your solution seems like a logical and robust one, even though it might come at a small performance price, as you say. One alternative, since as you noted this likely does not affect the valid cells, is simply to make sure that the Btmp and Etmp MultiFabs are initialized (with setVal(0.0)), after they are declared and created, and before ParallelCopy is called. This also removes the residual NaNs (personally I feel like every MultiFab should be automatically initialized to zero as soon as it is created, but this is a comment for AMReX). I guess my solution is not as rigorous as yours, though: what do you think?

@EZoni
Copy link
Member Author

EZoni commented Aug 19, 2021

Regarding the CI test LaserIonAcc2d, the value y passed to the function fun_filterparser here

if (!fun_filterparser(t, x, y, z, ux, uy, uz))

and computed shortly above as GetPosition(i, x, y, z); seems to be NaN, already for i = 0 and right when these diags are called at initialization, within WarpX::InitData.

@ax3l @atmyers I think it might have something to do with what happens inside PhysicalParticleContainer::InitData, PhysicalParticleContainer::AddParticles, and ultimately PhysicalParticleContainer::AddPlasma. Would you have an idea of why some y values for some particles in this 2D setup could be uninitialized and thus NaN?

@EZoni
Copy link
Member Author

EZoni commented Aug 19, 2021

Regarding the CI tests ionization_lab and ionization_boost, there is a division by zero here:

amrex::Real w_dtau = 1._rt/ ga * m_adk_prefactor[ion_lev] *
std::pow(E, m_adk_power[ion_lev]) *
std::exp( m_adk_exp_prefactor[ion_lev]/E );

Does anybody know how this formula should be modified when E = 0 (exponential on last line)?

Update Bug fix in #2214.

@ax3l
Copy link
Member

ax3l commented Aug 19, 2021

Awesome, love it! 🤩

LaserIonAcc2d [...] and computed shortly above as GetPosition(i, x, y, z); seems to be NaN, already for i = 0 and right when these diags are called at initialization, within WarpX::InitData.

I think that's a general 2D issue with filters: In GetPosition we set the y intentionally to m_snan and might need to be careful not to compare against it in the filter-parser. The filters in that example do not access y

histuH.filter_function(t,x,y,z,ux,uy,uz) = "abs(acos(uz / sqrt(uxux+uyuy+uz*uz))) <= 0.017453"

but maybe the parser still touches/copies the scalar y value during some transformations? cc @WeiqunZhang?

@WeiqunZhang
Copy link
Member

Yes, in Parser's opearator(), all arguments are copied into an internal array. That will touch y.

@EZoni
Copy link
Member Author

EZoni commented Aug 19, 2021

Regarding the CI tests embedded_boundary_cube and particle_absorption, the arrays Sx, Sy and Sz defined here

amrex::Array4<amrex::Real> const& Sx = face_areas[0]->array(mfi);
amrex::Array4<amrex::Real> const& Sy = face_areas[1]->array(mfi);
amrex::Array4<amrex::Real> const& Sz = face_areas[2]->array(mfi);

have NaNs in certain slices. For example, for embedded_boundary_cube I see Sx(48,:,:) = nan, Sy(:,48,:) = nan and Sz(:,:,48) = nan, and for particle_absorption I see Sx(64,:,:) = nan, Sy(:,64,:) = nan and Sz(:,:,64) = nan.

@lgiacome Would you be able to understand why this happens (namely why those last slices in the arrays inside m_face_areas are uninitialized and thus NaN) and fix it?

Some more hints:
I guess these m_face_areas are initialized in WarpX::ComputeFaceAreas, and I can confirm that the loop here

amrex::LoopOnCpu(amrex::convert(box, amrex::Box(face).ixType()),
[=](int i, int j, int k) {
face_areas_dim(i, j, k) = face(i, j, k);
});

is missing the last points that correspond to the uninitialized slices above. Is it possible that amrex::convert(box, amrex::Box(face).ixType()) is not the right box (with the right index type) that we want to use here?

@Yin-YinjianZhao
Copy link
Member

Yin-YinjianZhao commented Aug 20, 2021

Regarding the CI test collisionXZ, I ran it locally, on this branch, with run_test.sh, in DEBUG mode, and found that we try to compute the square root of a negative number here, because both T1t and T2t are equal to -1.0:

lmdD = T_R(1.0)/std::sqrt( n1*q1*q1/(T1t*PhysConst::ep0) +
n2*q2*q2/(T2t*PhysConst::ep0) );

I think the same is happening with the CI test collisionXYZ.
@RemiLehe @Yin-YinjianZhao
Does this look like a bug to you, too? Would you be able to understand why this happens and fix it, or shall I try to look into this?

@EZoni Thanks for pointing this out. I remember we set them to be -1 on purpose before. But apparently it is not good even if it does not affect the algorithm later on. I will try to modify something to avoid NaNs.

@lgiacome
Copy link
Member

Hi @EZoni!
I was aware of the problem of the NaN’s in the face areas at the domain boundaries. I included the fix to initialise also these entries in the PR for the ECT solver. Probably I should have done a separate PR..

@EZoni
Copy link
Member Author

EZoni commented Aug 24, 2021

Thank you, @lgiacome! I just tested the PR for the ECT solver for this particular issue, by running the CI tests embedded_boundary_cube and particle_absorption with the runtime option amrex.fpe_trap_invalid=1, and the test embedded_boundary_cube still seems to fail. It could be that some old NaNs are still there, or that there are new NaNs somewhere else. In order to tackle one problem at a time, would it be possible for you to open a mini PR just to address this particular issue here, so that we can test it and merge it separately? Let me know if I can help and/or share some work.

@ax3l
Copy link
Member

ax3l commented Aug 25, 2021

I work on two fixes for LaserIonAcc2d:

@EZoni EZoni changed the title [WIP] CI: Trap NaNs, Divisions by Zero, Overflows CI: Trap NaNs, Divisions by Zero, Overflows Aug 25, 2021
@EZoni EZoni marked this pull request as ready for review August 25, 2021 23:27
Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for proposing and triaging this, that's fantastic! 🎉
Thank you to everyone involved in the rapid bug fixes 🚀 ✨

@ax3l ax3l self-assigned this Aug 27, 2021
@ax3l ax3l enabled auto-merge (squash) August 27, 2021 22:26
@ax3l ax3l merged commit 9a4f5bd into ECP-WarpX:development Aug 27, 2021
@EZoni EZoni deleted the CI_trap branch October 1, 2021 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: tests Tests and CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants