Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model crash, negative dvice #2562

Closed
benjamin-cash opened this issue Jan 19, 2025 · 128 comments · May be fixed by #2625
Closed

Model crash, negative dvice #2562

benjamin-cash opened this issue Jan 19, 2025 · 128 comments · May be fixed by #2625

Comments

@benjamin-cash
Copy link

I am running the SFS configuration of the model, C129mx025, global-workflow and ufs-weather-model hashes as given here.

For the most part the runs have been running stably, but I have seen a significant number of crashes with error messages like the following:

PASS: fcstRUN phase 1, n_atmsteps =            10824 time is         2.919200

  (shift_ice)shift_ice: negative dvice
  (shift_ice)boundary, donor cat:           3           4
  (shift_ice)daice =  0.000000000000000E+000
  (shift_ice)dvice = -2.944524469334827E-065
    (icepack_warnings_setabort) T :file icepack_itd.F90 :line          551
 (shift_ice) shift_ice: negative dvice
 (icepack_warnings_aborted) ... (shift_ice)
 (icepack_warnings_aborted) ... (linear_itd)
 (icepack_warnings_aborted) ... (icepack_step_therm2)
 (icepack_warnings_aborted) ... (icepack_step_therm2)

@ShanSunNOAA - is this the same issue you have been seeing?

@ShanSunNOAA
Copy link
Collaborator

ShanSunNOAA commented Jan 19, 2025 via email

@NickSzapiro-NOAA
Copy link
Collaborator

Maybe a first thing to check is whether CICE initial conditions for crash cases are odd from the start. As mentioned in CICE-Consortium/Icepack#333), all IC ice thicknesses (vice(n)/aice(n) should be within the category bounds of hin_max. This check can be done offline or during initialization

I'm not sure of the origins of these ICs and there is permissions issue with the google sheets link, but maybe ICs or run directories are available somewhere?

@benjamin-cash
Copy link
Author

Hi @NickSzapiro-NOAA - thanks for pointing out that issue. I will definitely do that check, although on first glance there are some differences with that issue. This is not occurring rarely in my case - at last count I had 47 failures of this kind out of 231 runs. Some of the crashes are coming 2+ months into the simulation, which also seems a bit odd for an IC bug. Each ensemble member is also using the same ice initial file, and not all ensemble members are crashing.

Having said all that, here is a link to one of the ice initial files on AWS that is associated with a crash:
https://noaa-ufs-gefsv13replay-pds.s3.amazonaws.com/2012/05/2012050106/iced.2012-05-01-10800.nc

The runs are being performed on Frontera, which I could either give you access to or I could transfer a run directory offsite.

@NeilBarton-NOAA - I remember you saying there was an issue with sea ice ICs in the past that you had developed a workaround for, but looking at the code it seems like that was related to the ice edge and not the thickness bounds Nick mentioned here.

@NickSzapiro-NOAA
Copy link
Collaborator

I don't see any problems with the thickness categories in that IC file.

These dvice_negative aborts from ~ -10^-65 really seem so small, particularly relative to a_min/m_min/hi_min and zap_small_areas in Icepack. One test is to change that the donor is at least -puny instead (like Dave Bailey opened CICE-Consortium/Icepack#338 )

I also wonder how much residual ice (CICE-Consortium/CICE#645) is just present in these runs

Before really looking into cases or modifications, maybe @DeniseWorthen and @NeilBarton-NOAA have thoughts

@benjamin-cash
Copy link
Author

Hi @NickSzapiro-NOAA - some more context for this problem is that (to my knowledge) it does not appear in the case where the atmosphere and ocean are reduced to 1 degree (C96mx100). So far we are only seeing it in the C192mx025 runs, where we are using those IC files I pointed you to directly.

One thing I have not yet done is any kind of analysis of the ice in those runs, to see if there is something pathological going on. That's first up on my agenda for today. @ShanSunNOAA do you have any insights from your crashes?

@ShanSunNOAA
Copy link
Collaborator

I was testing the addition of the -ftz flag (flush-to-zero), but Denise pointed out that it was already in place. Why didn’t the flag work as expected and set the e-67 value to zero?

@benjamin-cash
Copy link
Author

@NickSzapiro-NOAA - If there was a diagnostic field that would give some clues as to what might be going on here, do you have a sense for what it would be? For example, I see a lot of very small aice_h values (e.g., 8.168064e-11, 7.310505e-14), but I'm not familiar enough with CICE to know what to make of these.

@NickSzapiro-NOAA
Copy link
Collaborator

I would say that very sparse ice is more technical than physical as it may not be moving, melting, or freezing depending on some threshold dynamics and thermodynamics settings (see CICE-Consortium/CICE#645).

It seems reasonable that the sparse areas are associated with these crashes but we can confirm. I'm happy to try to log more information about what's happening...I don't know if you would need to provide a run directory or if this is reproducible via global-workflow or such on RDHPCS.

If you're open to experimenting, the first "quick fix" that comes to mind is editing ufs-weather-model/CICE-interface/CICE/icepack/columnphysics/icepack_itd.F90

diff --git a/columnphysics/icepack_itd.F90 b/columnphysics/icepack_itd.F90
index 013373a..32debc2 100644
--- a/columnphysics/icepack_itd.F90
+++ b/columnphysics/icepack_itd.F90
@@ -462,7 +462,7 @@ subroutine shift_ice (trcr_depend,           &
                nd = donor(n)

                if (daice(n) < c0) then
-                  if (daice(n) > -puny*aicen(nd)) then
+                  if (daice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0
                   else
@@ -471,7 +471,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (dvice(n) < c0) then
-                  if (dvice(n) > -puny*vicen(nd)) then
+                  if (dvice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0

The reasoning is that this is more consistent with what cleared in the zap_small_areas routine in cleanup_itd

@benjamin-cash
Copy link
Author

My runs are using global-workflow but on Frontera via containers, so that might not be the easiest test case to give you. @ShanSunNOAA, what system are you making your runs on? I could also globus one of my run directories to Hercules.

@ShanSunNOAA
Copy link
Collaborator

Thanks @NickSzapiro-NOAA for a "quick fix"! I saved the crashed ice output under /scratch2/BMC/gsd-fv3-dev/sun/hr4_1013/COMROOT/c192mx025/gefs.20051101/00/mem000/model/ice/history_bad/.

I am testing your fix on Hera right now. Will let you know how it turns out later today.

Thanks!

@benjamin-cash
Copy link
Author

Thanks @ShanSunNOAA ! It takes a while for jobs to get through the queue on Frontera so I wouldn't be able to test nearly so quickly. @NickSzapiro-NOAA , if this does turn out to fix the problem, would you expect it to change answers more generally? I.e., would the successful runs need to be redone for consistency?

@NickSzapiro-NOAA
Copy link
Collaborator

I think this is the smallest change that would help and is a localized change near -puny<{a,v}icen<-puny*{a,v}icen where puny=1.0e-11_dbl_kind...so (hand-wavy) expectation is roundoff differences to 32bit atmosphere

A bigger change would be to more actively remove very sparse ice

@ShanSunNOAA
Copy link
Collaborator

ShanSunNOAA commented Jan 22, 2025

@NickSzapiro-NOAA The model crashed at the same time step as earlier, with a different error message:

1241:
1241: (shift_ice)shift_ice: dvice > vicen
1241: (shift_ice)boundary, donor cat: 3 4
1241: (shift_ice)dvice = 0.000000000000000E+000
1241: (shift_ice)vicen = -1.410289012658147E-065
1241: (icepack_warnings_setabort) T :file icepack_itd.F90 :line 594
1241: (shift_ice) shift_ice: dvice > vicen
1241: (icepack_warnings_aborted) ... (shift_ice)
1241: (icepack_warnings_aborted) ... (linear_itd)
1241: (icepack_warnings_aborted) ... (icepack_step_therm2)

Now it is vicen, not dvice, that has a value of -e65. Should the same treatment be applied to 'vicen'?
Thanks!

@NickSzapiro-NOAA
Copy link
Collaborator

Yes, sorry about that @ShanSunNOAA . Would you mind re-testing a change to all 4 conditions:

diff --git a/columnphysics/icepack_itd.F90 b/columnphysics/icepack_itd.F90
index 013373a..5d81bc3 100644
--- a/columnphysics/icepack_itd.F90
+++ b/columnphysics/icepack_itd.F90
@@ -462,7 +462,7 @@ subroutine shift_ice (trcr_depend,           &
                nd = donor(n)

                if (daice(n) < c0) then
-                  if (daice(n) > -puny*aicen(nd)) then
+                  if (daice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0
                   else
@@ -471,7 +471,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (dvice(n) < c0) then
-                  if (dvice(n) > -puny*vicen(nd)) then
+                  if (dvice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0
                   else
@@ -480,7 +480,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (daice(n) > aicen(nd)*(c1-puny)) then
-                  if (daice(n) < aicen(nd)*(c1+puny)) then
+                  if (daice(n) < aicen(nd)+puny) then
                      daice(n) = aicen(nd)
                      dvice(n) = vicen(nd)
                   else
@@ -489,7 +489,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (dvice(n) > vicen(nd)*(c1-puny)) then
-                  if (dvice(n) < vicen(nd)*(c1+puny)) then
+                  if (dvice(n) < vicen(nd)+puny) then
                      daice(n) = aicen(nd)
                      dvice(n) = vicen(nd)
                   else

@ShanSunNOAA
Copy link
Collaborator

Thank you, @NickSzapiro-NOAA, for your prompt response! I am testing it now.

@benjamin-cash
Copy link
Author

@NickSzapiro-NOAA - assuming this does fix the problem, how quickly do you think this can get incorporated as a PR? I can't help but notice that CICE-Consortium/CICE#645 has been open since 2021 and has had no activity since May of last year.

@benjamin-cash
Copy link
Author

@ShanSunNOAA - what is the earliest crash you see? Most of my runs are crashing at around the 50-day, 7 wallclock hour mark, which makes testing awkward.

@ShanSunNOAA
Copy link
Collaborator

@NickSzapiro-NOAA Good news - with your quick fix, my run successfully completed 3 months. Thank you again for your prompt help late last night - it made it possible to run it overnight, as it takes 4-5 hours to reach the crashing point.
@benjamin-cash My crashed runs typically occur around days 50–60 as well. In this particular case, it originally crashed on day 54.

@benjamin-cash
Copy link
Author

@ShanSunNOAA thanks for the quick reply! Do you have a fork of Icepack you can include the fix in, so we can be sure we are working off the same code?

@ShanSunNOAA
Copy link
Collaborator

@benjamin-cash I don't have a fork for this. @NickSzapiro-NOAA Are you going to submit a PR?

@NickSzapiro-NOAA
Copy link
Collaborator

Good to hear.

In UFS, we use an EMC fork of CICE that uses CICE-Consortium/Icepack submodule directly. Let me follow up at CICE-Consortium

@benjamin-cash
Copy link
Author

@ShanSunNOAA - could you confirm which hash of Icepack you are using in your runs? I've created a fork and a branch with the the proposed fix, but I want to be sure that I haven't gotten my submodules out of sync.

@DeniseWorthen
Copy link
Collaborator

Just a word of caution---while one previously failing case may now pass, it doesn't preclude some previously passing case will now fail. So unless you run the entire set of runs, you don't really know if this is a fix.

@benjamin-cash
Copy link
Author

@DeniseWorthen that's definitely a concern. Plus it is a change from the C96m100 baseline configuration. On the other hand it is show-stopper bug for the C192mx025 runs, so we definitely need to do something. My thinking at this point is to first rerun one or two of my successful cases and compare the outcomes.

@DeniseWorthen
Copy link
Collaborator

What would also be interesting to know is whether there is any seasonal signal in when the failing runs occur? Are they at a time of fast melt, hard freezing etc. Could you document the failed run dates and the day on which they fail?

@bingfu-NOAA
Copy link

@DeniseWorthen I think I need to chime in. We have many crashed cases like that in C384 run, mainly in May/June.

@benjamin-cash
Copy link
Author

@DeniseWorthen @bingfu-NOAA - I have't looked exhaustively, but they seem to be in the 2-3 month range (May 01 start). And it is definitely variable. For example, C192mx025_1995050100 mem003 crashed at hour 1627, while mem010 crashed at 1960.

@DeniseWorthen
Copy link
Collaborator

So, just for completeness in helping keep track of this issue, more information here:
https://bb.cgd.ucar.edu/cesm/threads/model-abort-due-to-dvice-negative-moved-from-cice-issues.8940/#post-52031

@benjamin-cash
Copy link
Author

@dabail10, tagging you in for awareness.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jan 31, 2025

I've made a series of runs using Ben's sandbox. In each case, the tested run directory was copied from my base.rundir, which is a copy of Ben's sandbox w/ the links to the fix files replaced and some other minor mods (no post etc). All used the HR4 tag, compiled as

./compile.sh hercules '-DAPP=S2S -D32BIT=ON -DHYDRO=ON -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_ugwpv1' s2s.hr4 intel NO NO 2>&1 | tee s2s.hr4.log

The only code mods I made were to comment out a bunch of print statements coming from ATM because I was sick of scrolling through them.

I have now completed the following runs:

Run ICE IC Description Last FH; Date FAIL zap_snow
base.nofix unmodified ice IC 1878; 2004-07-18 Wall clock Present
base.fix1 removed ice-on-land 1668; 2004-07-05 Wall clock none
base.fix2 removed ice-on-land + removed phantom ice 1896; 2004-07-19 Wall clock none
base.fix3 removed phantom ice 1896; 2004-07-19 Wall clock Present

As mentioned previously, the one case where I did obtain a negative dvice error was when I had compiled w/o the HYDRO and 32bit settings. I am now repeating that test case to verify that result (even though it is wrong).

@benjamin-cash At this point, my opinion is that both ice-on-land and phantom ice are pathologies, regardless of whether they cause your negative dvice issue. For example, fixing the ice-on-land removes the "zap_snow" messages you previously saw. With the ice-on-land fix, I can compile and run your case in debug mode > 12 hours. Whatever is causing Shan's NST-related debug failure is not present in your case. The settings for Shan's case are significantly different from yours, so I would not get hung up on the NST-failure.

I would like you to try to repeat your failed case on your platform, with the two mods to the ice IC and report back if you once again get a negative dvice error.

@benjamin-cash
Copy link
Author

Hi @DeniseWorthen - this is great, thanks! I will get those onto Frontera today and report back once they run.

@benjamin-cash
Copy link
Author

@DeniseWorthen - I've shipped my new ice file to /work/noaa/nems/cash/ice_fail/, could you look at it and confirm I've applied your filters correctly before I launch?

@DeniseWorthen
Copy link
Collaborator

The ncap2 commands look correct, but this doesn't seem to be the same IC file as in your original sandbox. It appears to be one of the files that Neil processed (has NaN attributes and an additional field "aicen_orig"). Your original file lacked both those characteristics.

Also, the only variables which should have changed are aicen, vicen, Tsfcn, vsnon. I'm seeing changes between your 20040501.000000.cice_model.res.nc file and the one from the fcst.37280 in multiple other fields (sice001 for example).

@benjamin-cash
Copy link
Author

Ah, yes, I did also apply his code. I'll rerun with just your filters (will have to wait a bit until I get back from an appointment).

@benjamin-cash
Copy link
Author

Denise - could you place your CICE IC file on Hercules or send me the path so I can have it for reference, rather than me pinging you each time I iterate on generating my own version?

@DeniseWorthen
Copy link
Collaborator

The version I have w/ both the ice-on-land and the phantom ice fixes is

/work2/noaa/stmp/dworthen/stmp/dworthen/negdvice/cice_model.res.fix2.nc

@DeniseWorthen
Copy link
Collaborator

As mentioned previously, the one case where I did obtain a negative dvice error was when I had compiled w/o the HYDRO and 32bit settings. I am now repeating that test case to verify that result (even though it is wrong).

I repeated my previous test case which had produced the negative dvice error and confirmed that the only case in which I produced a negative dvice-error was when I mistakenly left off the -D32BIT=ON -DHYDRO=ON settings during compile. In my failed run, I had fixed the ice-on-land error, but had not yet fixed the phantom-ice error.

@benjamin-cash
Copy link
Author

Update - Frontera is (apparently) having some kind of disk issue that has tanked the performance on my runs, so my test case keeps timing out. However, one thing I can report at least is that the zap_snow warning disappears with the new ICs. @LarissaReames-NOAA

@benjamin-cash
Copy link
Author

Success! Finally got my test case through whatever is going on with Frontera and it ran to completion. No dvice aborts and no zap_snow warnings.

I will go ahead and close this, although I will end with a plea to @dabail10 and @NickSzapiro-NOAA for some kind of check in the ice model to identify these kinds of issues at start up. Having the run proceed for 1900+ hours before crashing due to an IC problem is not very intuitive. :)

@DeniseWorthen
Copy link
Collaborator

@benjamin-cash I'm not at all sure why you're implying something needs to be fixed w/ CICE. There were no CICE changes required to "fix" the issue--just use of rational ICs.

@benjamin-cash
Copy link
Author

Hi @DeniseWorthen - Sorry that wasn't clear. What I would like to see is some kind of check in the ice model, or really anywhere in the code, for the pathologies in the ICs that you identified.

@DeniseWorthen
Copy link
Collaborator

But really, those pathologies were created by whatever process produced the replay ICs. As far as I understand it, those are not just restarts straight from CICE. There was DA involved etc?

@NickSzapiro-NOAA
Copy link
Collaborator

Thanks for your efforts to work through this and glad it was resolved. Do any of the other components have QC checks after reading native restarts?

@dabail10
Copy link

dabail10 commented Feb 4, 2025

I agree with @DeniseWorthen. The QC checks on the initial files should be done as a pre-processing step. I have had to do this with the qice and qsno fields to make sure they are consistent with the volume.

@guillaumevernieres
Copy link

But really, those pathologies were created by whatever process produced the replay ICs. As far as I understand it, those are not just restarts straight from CICE. There was DA involved etc?

@DeniseWorthen is correct, this was a bug in the pre-processing of the IC's. The replay was inserting the ORAS5 (or OSTIA?) analysis into the CICE restarts. I'll create an issue in the relevant JCSDA/JEDI/SOCA repo to make sure this has been addressed.

@DeniseWorthen
Copy link
Collaborator

Adding a note for future reference.

I had one case where I was able to produce a negative dvice error --- leaving off HYDRO/32Bit but fixing the ice-on-land in the IC.

I repeated that case w/ the removal of phantom-ice and it resolve the issue. I no longer got negative dvice in the HYDRO/32Bit off case.

@DeniseWorthen
Copy link
Collaborator

Another note for future reference.

In working ufs-community/UFS_UTILS#1019, I've found that there can also be vsnon values where sum(aicen)==0. These were not removed in the original ncap2 commands. The command would be

ncap2 -s 'where(aicen.total($ncat) == 0) vsnon=0' in.nc out.nc

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Feb 7, 2025

@LarissaReames-NOAA The following script can be used to fix existing 1/4 deg ICE restart files using ncap2.

This script will result in a QC'd source file identical to that created by the downscaling utility (the QC'd fields are then down-scaled).

#!/bin/bash

set -x

maskfile=/scratch1/NCEPDEV/global/glopara/fix/cice/20240416/025/kmtu_cice_NEMS_mx025.nc
src=cice_model.res.nc
dst=ncap2.fixes.nc

cp ${src} ${dst}
ncks -A ${maskfile} ${dst}
ncap2 -O -s 'where(kmt==0) aicen=0.0' ${dst} ${dst}
ncap2 -O -s 'where(kmt==0) vicen=0.0' ${dst} ${dst}
ncap2 -O -s 'where(kmt==0) vsnon=0.0' ${dst} ${dst}
ncap2 -O -s 'where(kmt==0) Tsfcn=0.0' ${dst} ${dst}
ncap2 -O -s 'where(aicen.total($ncat) == 0.0) vicen=0.0' ${dst} ${dst}
ncap2 -O -s 'where(aicen.total($ncat) == 0.0) vsnon=0.0' ${dst} ${dst}
ncks -O -x -v kmt ${dst} ${dst}

@LarissaReames-NOAA
Copy link
Collaborator

Thanks @DeniseWorthen . We'll make sure the relevant users are aware of this.

@benjamin-cash
Copy link
Author

I am still seeing a relatively large number of dvice crashes, but I missed the update to the set of ncap2 commands. They also seem to be predominantly focused on a few sets of ICs, rather than scattered across the runs. I'm going to apply the new vsnon filter and rerun the failed cases - I'll update here on the result.

@DeniseWorthen
Copy link
Collaborator

Do these runs include Neil's processing of the IC?

@benjamin-cash
Copy link
Author

benjamin-cash commented Feb 10, 2025

No, I just downloaded the files from AWS and applied the nco filters, no additional processing via Neil's script. But at the time I created them you either hadn't posted or I missed seeing ncap2 -s 'where(aicen.total($ncat) == 0) vsnon=0' in.nc out.nc, so I didn't apply that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants