CI: Trap NaNs, Divisions by Zero, Overflows #2205

EZoni · 2021-08-18T17:12:20Z

Trap NaNs, divisions by zero, and overflows when running our CI tests.

EZoni · 2021-08-18T20:18:51Z

CI tests that crashed or failed:

(bug fix in Ionization.H: Fix Bug (Division by Zero) #2214): ionization_lab
(bug fix in Ionization.H: Fix Bug (Division by Zero) #2214): ionization_boost
(bug fix in Avoid an NaN in collision module #2225): collisionXZ
(bug fix in Avoid an NaN in collision module #2225): collisionXYZ
(bug fix in WarpXComm.cpp: Fix Bug (Uninitialized MultiFabs) #2230): momentum-conserving-gather
(bug fix in WarpXInitEB.cpp: Fix Bug (avoid NaNs) #2231): embedded_boundary_cube, particle_absorption
(bug fix in Laser-Ion Example: Fix Histogram Functs #2232, GetPosition 2D3V: y=0 #2233): LaserIonAcc2d

EZoni · 2021-08-18T23:14:45Z

Regarding the CI test collisionXZ, I ran it locally, on this branch, with run_test.sh, in DEBUG mode, and found that we try to compute the square root of a negative number here, because both T1t and T2t are equal to -1.0:

WarpX/Source/Particles/Collision/ElasticCollisionPerez.H

Lines 97 to 98 in 51c42e3

    
           lmdD = T_R(1.0)/std::sqrt( n1*q1*q1/(T1t*PhysConst::ep0) + 
        
                                n2*q2*q2/(T2t*PhysConst::ep0) );

I think the same is happening with the CI test collisionXYZ.

@RemiLehe @Yin-YinjianZhao
Does this look like a bug to you, too? Would you be able to understand why this happens and fix it, or shall I try to look into this?

Update Bug fix in #2225.

EZoni · 2021-08-19T00:44:50Z

Regarding the CI test momentum-conserving-gather there are some NaNs in the arrays Bxg, Byg, Bzg, Exg, Eyg, Ezg used in the kernel Source/Parallelization/WarpXComm_K.H. So the crash is caused by activating amrex.fpe_trap_invalid=1. The NaNs can be observed by running this test locally, on this branch, with run_test.sh, in DEBUG mode, but setting amrex.fpe_trap_invalid=0 (so overwriting this particular change in this PR, otherwise the code crashes without printing).

For example, looking at Bxg, it is passed as bx_c here

WarpX/Source/Parallelization/WarpXComm.cpp

Line 213 in 263d962

warpx_interp_nd_bfield_x(j,k,l, bx_aux, bx_fp, bx_cp, bx_c);

and bx_c is computed from Btmp here

WarpX/Source/Parallelization/WarpXComm.cpp

Line 205 in 263d962

Array4<Real const> const& bx_c = Btmp[0]->const_array(mfi);

which, in turn, is computed here

WarpX/Source/Parallelization/WarpXComm.cpp

Lines 181 to 189 in 263d962

    
           // ParallelCopy from coarse level 
        
           for (int i = 0; i < 3; ++i) { 
        
               IntVect ng = Btmp[i]->nGrowVect(); 
        
               // Guard cells may not be up to date beyond ng_FieldGather 
        
               const amrex::IntVect& ng_src = guard_cells.ng_FieldGather; 
        
               // Copy Bfield_aux to Btmp, using up to ng_src (=ng_FieldGather) guard cells from 
        
               // Bfield_aux and filling up to ng (=nGrow) guard cells in Btmp 
        
               Btmp[i]->ParallelCopy(*Bfield_aux[lev-1][i], 0, 0, 1, ng_src, ng, cperiod); 
        
           }

So I think these NaNs have something to do with the recent changes in #2144. I can confirm that the NaNs disappear if we revert the changes in #2144 (namely if we replace ng_src with ng everywhere). What do you think, @NeilZaim?

NeilZaim · 2021-08-19T15:39:26Z

@EZoni Thanks for adding this PR, that's a great idea!
I could reproduce your issue regarding #2144. Maybe this PR was indeed a bit premature, since as you observed we can have some unititialized values if we don't use all the guard cells in the ParallelCopy (I don't think it should affect the regular cells too much, and in fact #2144 didn't break CI, but it's still dangerous to leave unititialized values hanging around).
I don't think we should simply revert ng_src to ng either because then nonupdated guard cells can affect regular cells when the calls the FillBoundary are done with ng_FieldGather guard cells. (the initial issue that prompted #2144)
In fact, it looks like calling FillBoundary with sometimes ng_alloc_EB guard cells and sometimes ng_FieldGather guard cells can be a bit tricky, precisely because of these type of issues. So I'm thinking that maybe for now we should always call FillBoundary with ng_alloc_EB guard cells to ensure we have a correct code and then iterate from here. Initially I did not want to do that because I was afraid of exchanging all the guard cells twice per timestep with PSATD, but in fact with PSATD we somehow use ng_FieldGather = ng_alloc_EB, meaning that we are already exchanging all the guard cells twice per timestep. (I'll try to improve that afterwards because this should significantly slow down the simulations in 3D).
On the other hand, using ng_alloc_EB everywhere would result in a small performance penalty with FDTD (more guard cell exchanges) but I feel like starting with a code working in all situations would make improving on this much simpler and be worth it in the long run. What do you think about this? Do you see another simple solution? If what I proposed makes sense to you, I can do this in a quick PR.

EZoni · 2021-08-19T16:48:20Z

@NeilZaim Thank you for your comments. Your solution seems like a logical and robust one, even though it might come at a small performance price, as you say. One alternative, since as you noted this likely does not affect the valid cells, is simply to make sure that the Btmp and Etmp MultiFabs are initialized (with setVal(0.0)), after they are declared and created, and before ParallelCopy is called. This also removes the residual NaNs (personally I feel like every MultiFab should be automatically initialized to zero as soon as it is created, but this is a comment for AMReX). I guess my solution is not as rigorous as yours, though: what do you think?

EZoni · 2021-08-19T17:53:06Z

Regarding the CI test LaserIonAcc2d, the value y passed to the function fun_filterparser here

WarpX/Source/Diagnostics/ReducedDiags/ParticleHistogram.cpp

Line 218 in 263d962

if (!fun_filterparser(t, x, y, z, ux, uy, uz))

and computed shortly above as GetPosition(i, x, y, z); seems to be NaN, already for i = 0 and right when these diags are called at initialization, within WarpX::InitData.

@ax3l @atmyers I think it might have something to do with what happens inside PhysicalParticleContainer::InitData, PhysicalParticleContainer::AddParticles, and ultimately PhysicalParticleContainer::AddPlasma. Would you have an idea of why some y values for some particles in this 2D setup could be uninitialized and thus NaN?

EZoni · 2021-08-19T18:29:24Z

Regarding the CI tests ionization_lab and ionization_boost, there is a division by zero here:

WarpX/Source/Particles/ElementaryProcess/Ionization.H

Lines 127 to 129 in 263d962

    
           amrex::Real w_dtau = 1._rt/ ga * m_adk_prefactor[ion_lev] * 
        
               std::pow(E, m_adk_power[ion_lev]) * 
        
               std::exp( m_adk_exp_prefactor[ion_lev]/E );

Does anybody know how this formula should be modified when E = 0 (exponential on last line)?

Update Bug fix in #2214.

ax3l · 2021-08-19T19:13:52Z

Awesome, love it! 🤩

LaserIonAcc2d [...] and computed shortly above as GetPosition(i, x, y, z); seems to be NaN, already for i = 0 and right when these diags are called at initialization, within WarpX::InitData.

I think that's a general 2D issue with filters: In GetPosition we set the y intentionally to m_snan and might need to be careful not to compare against it in the filter-parser. The filters in that example do not access y

histuH.filter_function(t,x,y,z,ux,uy,uz) = "abs(acos(uz / sqrt(uxux+uyuy+uz*uz))) <= 0.017453"

but maybe the parser still touches/copies the scalar y value during some transformations? cc @WeiqunZhang?

WeiqunZhang · 2021-08-19T20:21:05Z

Yes, in Parser's opearator(), all arguments are copied into an internal array. That will touch y.

EZoni · 2021-08-19T20:40:27Z

Regarding the CI tests embedded_boundary_cube and particle_absorption, the arrays Sx, Sy and Sz defined here

WarpX/Source/FieldSolver/FiniteDifferenceSolver/EvolveB.cpp

Lines 118 to 120 in 263d962

    
           amrex::Array4<amrex::Real> const& Sx = face_areas[0]->array(mfi); 
        
           amrex::Array4<amrex::Real> const& Sy = face_areas[1]->array(mfi); 
        
           amrex::Array4<amrex::Real> const& Sz = face_areas[2]->array(mfi);

have NaNs in certain slices. For example, for embedded_boundary_cube I see Sx(48,:,:) = nan, Sy(:,48,:) = nan and Sz(:,:,48) = nan, and for particle_absorption I see Sx(64,:,:) = nan, Sy(:,64,:) = nan and Sz(:,:,64) = nan.

@lgiacome Would you be able to understand why this happens (namely why those last slices in the arrays inside m_face_areas are uninitialized and thus NaN) and fix it?

Some more hints:
I guess these m_face_areas are initialized in WarpX::ComputeFaceAreas, and I can confirm that the loop here

WarpX/Source/EmbeddedBoundary/WarpXInitEB.cpp

Lines 188 to 191 in 263d962

    
           amrex::LoopOnCpu(amrex::convert(box, amrex::Box(face).ixType()), 
        
                           [=](int i, int j, int k) { 
        
               face_areas_dim(i, j, k) = face(i, j, k); 
        
           });

is missing the last points that correspond to the uninitialized slices above. Is it possible that amrex::convert(box, amrex::Box(face).ixType()) is not the right box (with the right index type) that we want to use here?

Yin-YinjianZhao · 2021-08-20T04:15:29Z

Regarding the CI test collisionXZ, I ran it locally, on this branch, with run_test.sh, in DEBUG mode, and found that we try to compute the square root of a negative number here, because both T1t and T2t are equal to -1.0:

WarpX/Source/Particles/Collision/ElasticCollisionPerez.H

Lines 97 to 98 in 51c42e3

lmdD = T_R(1.0)/std::sqrt( n1*q1*q1/(T1t*PhysConst::ep0) +

n2*q2*q2/(T2t*PhysConst::ep0) );

I think the same is happening with the CI test collisionXYZ.
@RemiLehe @Yin-YinjianZhao
Does this look like a bug to you, too? Would you be able to understand why this happens and fix it, or shall I try to look into this?

@EZoni Thanks for pointing this out. I remember we set them to be -1 on purpose before. But apparently it is not good even if it does not affect the algorithm later on. I will try to modify something to avoid NaNs.

lgiacome · 2021-08-23T16:40:56Z

Hi @EZoni!
I was aware of the problem of the NaN’s in the face areas at the domain boundaries. I included the fix to initialise also these entries in the PR for the ECT solver. Probably I should have done a separate PR..

EZoni · 2021-08-24T00:10:50Z

Thank you, @lgiacome! I just tested the PR for the ECT solver for this particular issue, by running the CI tests embedded_boundary_cube and particle_absorption with the runtime option amrex.fpe_trap_invalid=1, and the test embedded_boundary_cube still seems to fail. It could be that some old NaNs are still there, or that there are new NaNs somewhere else. In order to tackle one problem at a time, would it be possible for you to open a mini PR just to address this particular issue here, so that we can test it and merge it separately? Let me know if I can help and/or share some work.

ax3l · 2021-08-25T07:48:52Z

I work on two fixes for LaserIonAcc2d:

histogram functions: avoid sqrt(0) and 1/sqrt(0) in inputs (resting particles with |u|==0): Laser-Ion Example: Fix Histogram Functs #2232
GetPosition: change & document that the default for y positions in 2D3V is 0.0 (reasonable & more general inputs possible): GetPosition 2D3V: y=0 #2233

Regression/prepare_file_travis.py

ax3l

Thank you for proposing and triaging this, that's fantastic! 🎉
Thank you to everyone involved in the rapid bug fixes 🚀 ✨

EZoni added the component: tests Tests and CI label Aug 18, 2021

EZoni changed the title ~~CI: Trap NaNs, Divisions by Zero, Overflows~~ [WIP] CI: Trap NaNs, Divisions by Zero, Overflows Aug 18, 2021

EZoni mentioned this pull request Aug 19, 2021

Ionization.H: Fix Bug (Division by Zero) #2214

Merged

EZoni force-pushed the CI_trap branch from 84ddb4a to b93542c Compare August 23, 2021 15:56

Yin-YinjianZhao mentioned this pull request Aug 23, 2021

Avoid an NaN in collision module #2225

Merged

EZoni mentioned this pull request Aug 25, 2021

WarpXComm.cpp: Fix Bug (Uninitialized MultiFabs) #2230

Merged

EZoni force-pushed the CI_trap branch from b93542c to 6ca3075 Compare August 25, 2021 00:54

lgiacome mentioned this pull request Aug 25, 2021

WarpXInitEB.cpp: Fix Bug (avoid NaNs) #2231

Merged

EZoni force-pushed the CI_trap branch from 6ca3075 to 7e5cf14 Compare August 25, 2021 15:46

This was referenced Aug 25, 2021

Laser-Ion Example: Fix Histogram Functs #2232

Merged

GetPosition 2D3V: y=0 #2233

Merged

ax3l reviewed Aug 25, 2021

View reviewed changes

Regression/prepare_file_travis.py Outdated Show resolved Hide resolved

EZoni force-pushed the CI_trap branch from 58af206 to 537024c Compare August 25, 2021 21:12

EZoni changed the title ~~[WIP] CI: Trap NaNs, Divisions by Zero, Overflows~~ CI: Trap NaNs, Divisions by Zero, Overflows Aug 25, 2021

EZoni marked this pull request as ready for review August 25, 2021 23:27

CI: Trap NaNs, Divisions by Zero, Overflows

0561779

ax3l force-pushed the CI_trap branch from 537024c to 0561779 Compare August 27, 2021 22:24

ax3l approved these changes Aug 27, 2021

View reviewed changes

ax3l self-assigned this Aug 27, 2021

ax3l enabled auto-merge (squash) August 27, 2021 22:26

ax3l merged commit 9a4f5bd into ECP-WarpX:development Aug 27, 2021

EZoni deleted the CI_trap branch October 1, 2021 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Trap NaNs, Divisions by Zero, Overflows #2205

CI: Trap NaNs, Divisions by Zero, Overflows #2205

EZoni commented Aug 18, 2021 •

edited

Loading

EZoni commented Aug 18, 2021 •

edited by ax3l

Loading

EZoni commented Aug 18, 2021 •

edited

Loading

EZoni commented Aug 19, 2021 •

edited

Loading

NeilZaim commented Aug 19, 2021

EZoni commented Aug 19, 2021 •

edited

Loading

EZoni commented Aug 19, 2021 •

edited

Loading

EZoni commented Aug 19, 2021 •

edited

Loading

ax3l commented Aug 19, 2021 •

edited

Loading

WeiqunZhang commented Aug 19, 2021

EZoni commented Aug 19, 2021 •

edited

Loading

Yin-YinjianZhao commented Aug 20, 2021 •

edited by EZoni

Loading

lgiacome commented Aug 23, 2021

EZoni commented Aug 24, 2021

ax3l commented Aug 25, 2021 •

edited

Loading

ax3l left a comment

CI: Trap NaNs, Divisions by Zero, Overflows #2205

CI: Trap NaNs, Divisions by Zero, Overflows #2205

Conversation

EZoni commented Aug 18, 2021 • edited Loading

EZoni commented Aug 18, 2021 • edited by ax3l Loading

EZoni commented Aug 18, 2021 • edited Loading

EZoni commented Aug 19, 2021 • edited Loading

NeilZaim commented Aug 19, 2021

EZoni commented Aug 19, 2021 • edited Loading

EZoni commented Aug 19, 2021 • edited Loading

EZoni commented Aug 19, 2021 • edited Loading

ax3l commented Aug 19, 2021 • edited Loading

WeiqunZhang commented Aug 19, 2021

EZoni commented Aug 19, 2021 • edited Loading

Yin-YinjianZhao commented Aug 20, 2021 • edited by EZoni Loading

lgiacome commented Aug 23, 2021

EZoni commented Aug 24, 2021

ax3l commented Aug 25, 2021 • edited Loading

ax3l left a comment

Choose a reason for hiding this comment

EZoni commented Aug 18, 2021 •

edited

Loading

EZoni commented Aug 18, 2021 •

edited by ax3l

Loading

EZoni commented Aug 18, 2021 •

edited

Loading

EZoni commented Aug 19, 2021 •

edited

Loading

EZoni commented Aug 19, 2021 •

edited

Loading

EZoni commented Aug 19, 2021 •

edited

Loading

EZoni commented Aug 19, 2021 •

edited

Loading

ax3l commented Aug 19, 2021 •

edited

Loading

EZoni commented Aug 19, 2021 •

edited

Loading

Yin-YinjianZhao commented Aug 20, 2021 •

edited by EZoni

Loading

ax3l commented Aug 25, 2021 •

edited

Loading