-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating ML Warm Rain Emulators #50
Comments
From Hugh Morrison:
|
From Gagne: It looks like nragg is 0 if qric < qsmall (1e-18) https://github.com/ESCOMP/PUMAS/blame/fbcb46a91b5b3f1e84c898a8b00f91e69b0fe4db/micro_pumas_utils.F90#L2166. Should that be changed? I am trying to run the code now without the nr ratio check and will see if things blow up in interesting ways… The version I have been experimenting with is in /glade/work/dgagne/CAM with modifications to PUMAS in
|
Greetings. I'll confirm what David John has said thus far: I took a look at the current code, and the original unported code that ran.
I'm surprised this even worked! I will start with rebuilding the TAU code to correct these issues (the ratios for one, I need a strategy for that). Then we can retrain, and work on the emulator code. This may take me a few days since I am pretty busy. |
@andrewgettelman - Feel free to either open a PR back to the https://github.com/ESCOMP/PUMAS/tree/pumas_ml_tau_ported branch or just commit/push your changes directly back to this branch if you feel comfortable doing either of these. I am not currently doing any cleanup, so we won't trample on each other. Alternatively point me to your source mods and I can bring them into this branch once you are done. |
Question for the group, especially @cacraigucar and @Katetc: I think we need to re-design the tendencies here to be consistent. I am thinking that:
Thoughts? |
Not sure this is relevant but reported by Brian Dobbins: I did just try the same code that fails by default with the Intel 2022.1 compiler (switched from Intel 2019, the default), and it ran without a crash. Still in DEBUG mode, I literally just swapped the 'intel/19.1.1' entry in env_mach_specific.xml with 'intel/2022.1', and ran, by hand. Prior to this it crashed on step 881, now it's completed, though still gives warnings about significant water conservation errors. Runs here: Failed: /glade/scratch/bdobbins/scam_ml_crash.arm97/run/atm.log.9486121.chadmin1.ib0.cheyenne.ucar.edu.230418-144729 I ran the latter by hand, interactively, hence the lack of a full name on the job. So perhaps there are scientific and technical issues. |
Okay, made an initial assessment. I'm going to start with Goal #1: get 'tau' working. Steps:
Comments welcome on this plan. |
@andrewgettelman I think that plan sounds reasonable. Getting TAU working with the right output variables and ratio checks makes a lot of sense as a first step. I imagine we will need new training data for the emulator after that. With a much better idea of the constraints going into the model, the retrained emulator should be able to be both simpler and more performant. |
As a way of joining the convo a few naive questions:
|
Hi @mvanlierwalq...
Yes, should be.
I think so: the original 'unported' (sorry 'ported' means ported to newer version, so unported is old code from the ML paper: Gettelman et al 2021) code had the TAU bin operation occurring on in-cloud values. The emulator was operating on grid means, so we have to figure out if something was lost in translation. Always a problem when you are jumping back into it. There probably is a logic here I need to understand.
Not sure: @djgagne will have to answer this. |
Adding myself to thread. |
Cloud fraction and precipitation fraction were both inputs to the emulator in the original paper and in the updated emulator being tested. The emulator was blowing up pretty consistently before they were added, so the neural network did figure out how to use them approximately. I do think the emulator will have a much easier time learning the relationship once the in-cloud values are used for training. |
Jumping in as well |
replying to @cacraigucar
Yes
|
from @djgagne replying to @cacraigucar Regarding the extra functions in module neural network that aren’t currently being used: |
Just a quick update:
|
@andrewgettelman - Can you point me to your current code mods that I should use? I would like to commit to both the PUMAS branch as well as the CAM branch, so we are both working with the same code base. |
Okay, that seems to complete and give me sensible answers in 3D. Not exactly the same (expected), but at least it looks like earth (sensible climate). @cacraigucar - code is here: /glade/u/home/andrew/cesm_ch/machlearn/mods_ml_tau_ported2 @djgagne - I think I can now try to generate a run for training data. I will obviously output tendencies for qctend_TAU, nctend_TAU and nrtend_TAU. What other variables should I provide as instantaneous output? Do you recall what you used in the last iteration and how frequently? Also how much data went into it. If you refresh my memory I can generate new training data. Maybe it's best to check this code in so I can run from an actual tag/branch as Cheryl suggests.... |
Summary of the major issues fixed in the code:
|
I am glad to hear that the code changes appear to have fixed the problems! The last training simulation I have output instantaneous information every 23 hours for 2 years so that it would step through the day. The files are located at I am currently using the following input variables:
Will you save out the grid cell or in-cloud versions of QC, QR, NC, and NR? I would like to train on the in-cloud versions but could also derive them from the total grid cell versions (or derive total grid cell from in-cloud) as long as I have lcldm, precip_frac, and RHO_CLUBB. I also have the following diagnostics in the output file: lamc, lamr, n0r, pgam. I don't think they are needed for the emulator at this point if you want to drop them, but they could be useful. For outputs, I would want the following variables:
With the quantile transform method, I might be able to predict the tendencies directly, which would simplify things further. If I don't have to worry about directly conserving mass, then I can make a single neural net based on the tendencies and get the output "close enough" for the conservation checks to clip them. |
Thanks @djgagne - I will check the code and see about generating the output. I think I can get all these variables to be in-cloud (input, output and tendencies) to be consistent. That's NOT quite how it is done now, so I am checking this. If it is simpler to do the tendencies directly and clean them up later, I am happy to try that.... |
Just an update: in double checking everything I found a mistake in my ordering that was neglecting initial rain in the TAU code. This then uncovered a few other issues in my original implementation. I have it running in SCAM now, but it's crashing in the 3D code after 1 day or so. Still working on this: think I need to look at guardrails again and the original implementation. |
Okay. I've validated my current version is BFB with the original KK2000. I also put in a conservation check in the code so the TAU collection code will not produce tendencies that result in negative number or mass. However, the tendency from autoconversion and accretion can now be positive or negative. @hmorrison100 - It seems like the 'ratio' conservation checks (e.g. for autoconversion) are assuming that all tendencies have only one sign, and that is now not the case with stochastic collection. E.g.:
If stochastic collection is the only thing operating and it increases nc (prc < 0) then dum < 0 and no conservation is done at this point as far as I can tell (no scaling of tendencies). I guess that is okay (since I have checked that there cannot be non-conservation from stochastic collection itself).... Thoughts? SCAM runs a standard case, but the 3D model seems to be dying on the first time step. |
Last comment of the day: if I 'clip' the stochastic collection tendencies to only be one sign, then the code runs through a month. So I think we might need to think about how to adjust the conservation checks to accommodate tendencies that can be positive or negative for 'autoconversion' |
I have brought in Andrew's changes (as of today) and made commits to both the PUMAS branch and my CAM branch. To check them out: git clone https://github.com/cacraigucar/CAM cam_ml_tau_ported I will now proceed with implementing additional code cleanup and will give y'all the new hash when that code is committed |
Hi @cacraigucar - I did a lot of development yesterday. The current version is /glade/u/home/andrew/cesm_ch/machlearn/mods_ml_tau_ported4 which has a few more modifications and gets the code back to BFB with 'kk2000' This version now runs TAU about 6 days in 3-D before dying. Something in the state is going wrong the conservation checkers are not catching and then eventually CLUBB fails. |
I have a new hash for my latest code commit. This contains Andrews mods_ml_tau_ported4 files from mid-day today as well as cleanup that I've made. The hash is 93eebd0. I'm using the ne30pg3_ng30pg3_mg17 grid and compset F2000dev for my tests and I am setting DEBUG=TRUE. This code runs successfully for 9 time steps using "emulated" and the three files that I have from a while ago for the stochastic_emulated_filename... files. These are files in the /glade/work/dgagne/cam_mp_run6_quantile_nn directory. Unfortunately my "tau" run crashes with a floating point exception on the first time step. The partial traceback is: |
The crash is the same statement that has been giving me trouble. The ratio calculation for rain number (nr):
However, I have tried trapping for a failure here, and then I can get it to go farther but it crashes in CLUBB (~200 timesteps for me) |
Is there a way to set nr to 1e-12 if the predicted tendency pushes it below a certain threshold? When debugging before, I noticed that sometimes the updated nr value would be 1e-12 and sometimes it would be 0. I also noticed that the nr ratio check would produce divide by 0 errors because all the tendencies in the denominator ended up being 0. Is there an epsilon term we could add to the denominator to prevent those divide by 0 errors? |
@djgagne @andrewgettelman Yes, we might want to add an epsilon to prevent division by zero for the Nr check. However, there seems to be another bug, see *** in the comments below. Below are some comments after looking at the code. NOTE: line numbers below refer to micro_pumas_v1.F90 in the "pumas_ml_tau_ported" branch. |
On an unrelated topic, I am bringing the KBARF file into PUMAS as @andrewgettelman suggested in an email discussion a couple of months ago. I would suggest that at the very least we rename this file to KBARF.txt (since it is an ASCII file). I would suggest two modifications and have one question:
Question: I can make the changes based on the answers I receive. I do need answers to questions 1 and 2 as I don't know the answer myself |
Hi @cacraigucar - we went over this before. KBARF is the name used in the original code. we could call it KBARF_tau_kernel.txt I'd probably prefer to leave the file unmodified so we don't have to change the code or the file. Is it small enough to put it in the code directory? Or does it need to go somewhere else? |
@andrewgettelman Thanks for the new name for the file. I won't add anything to file contents itself (though it wouldn't be hard to read a couple of lines of text before reading the actual contents). Yes, I am planning on putting it into the PUMAS github repository since it is so small. I thought it might be nice to "hide" the file by creating a /data directory in PUMAS and putting the file in there. Other small "data" files that PUMAS might expand to needing in the future could also go in this directory. Otherwise, I can just place it in right beside all of the PUMAS code and not make a /data directory. Let me know your thoughts. |
Just a quick update: after discussions with @hmorrison100 I managed to get the code working, and TAU now runs through a month, producing reasonable values. I'll do some more checking and cleanup and try to get it ready by tomorrow for checking in something that works, and thinking about trying a training run for @djgagne . Not sure if I will get that done before leaving for travel tomorrow. Thanks again @hmorrison100 ! |
@andrewgettelman - please point us to your latest code before you leave, if you can. I can work on cleaning it up if needed (removing print/write statements, etc.) |
@cacraigucar for the moment let's just put the file in the PUMAS directory. Where are you going to put a description if not in the file? In a README? The description can read: KBARF_tau_kernel.txt contains coefficients for the stochastic collection kernel used by the TAU stochastic collection code, invoked in PUMAS with micro_pumas_warm_rain = 'tau' I've cleaned up my code a bit, verified it runs in SCAM, and am starting a 2 year test that should give training data for @djgagne So even if this fails, it's probably as far as I am going to get this week, and you can tag it: |
The code completed a year. It finished by something happened with a restart file and did not auto-restart. I'm off today for a week, and don't have time to deal with that. But I have 1 year of training data @djgagne if you want to look at it or take a first cut at it: These are instantaneous fields, but we may have some issues with sub-stepping (I cannot forget how we shut off all substeps). For now if it is not too much effort it might be good to do a quick update of the emulator just to see if I can get the code to run with it before we finalize and check everything in. |
@andrewgettelman Are the input QC, QR, etc. fields in-cloud or for the whole grid cell? I can get the initial data processed through my scripts either today or Monday. |
Thanks @djgagne - Everything (input, output and tendencies) are in-cloud now, |
Here is a quick comparison of the new code climatology, compared to the 2021 paper version. Note the 'control' is very different now with PUMAS (control = KK2000). Thus this seems to be doing what we think it should be doing. @djgagne if you have any questions on the data used for the emulator, let me know. If you get a new emulator file, I can test it (or I can point you to the code to integrate it. Thanks! @cacraigucar, I'm not sure what we want to do with this now. I guess there is not a strong urge to move it into CAM until we get the emulator sorted out. |
@andrewgettelman I met with Wayne earlier this week to go over the emulator training code and will work on getting an updated baseline emulator trained. I have had a large number of meetings and haven't had time to do the processing yet but was at least able to take s quick look at and copy the CAM output files. |
From David John Gagne:
Going through the PUMAS code, I’ve found a few more issues that may be affecting both TAU and the emulator. They appear to be fixable if we can get all of our ducks in a row.
On the machine learning side, I think it would make a lot of sense to re-train the emulator with in-cloud input values. Should I assume lcldm and precip_frac remain constant between the input and output qc, qr, nc, and nr when converting the calculations to in-cloud?
The text was updated successfully, but these errors were encountered: