Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Multi-run simulations #198

Closed
lorenzocostantino opened this issue Feb 10, 2022 · 6 comments
Closed

[QUESTION] Multi-run simulations #198

lorenzocostantino opened this issue Feb 10, 2022 · 6 comments
Assignees
Labels
category: Question Further information is requested

Comments

@lorenzocostantino
Copy link

I use GCHP 13.3.4 (at C180) and I want to launch multi-run simulations (ideally 365 daily runs).

I saw that you have already shown how to perform a multi-run simulation (e.g., #136 and others) but sometimes answers change as model versions evolve. I also checked the c360_requeuing.sh script. Still, it is not completely clear to me how to launch multiple chained runs with GCHP 13.3.4, as I would do something somewhat different from the c360_requeuing.sh

Let's say I have

Start_Time_Date="20190101"
End_Time_Date="20190110"
Duration_Date="00000001"  # 1 day

I see that this model version automatically updates the cap_restart file at the end of each segment.
If Periodic_Checkpoint=OFF, at the end of each segment only the "gcchem_internal_checkpoint" file is written, overwriting the "gcchem_internal_checkpoint" file of the previous run.
If I am not wrong, we can use this file as GCHPchem_INTERNAL_RESTART_FILE for segment 2 and onward, updating GCHP.rc (and then re-launch gchp.sh) at line 70

# Chemistry/AEROSOL Model Restart Files
# Enter +none for GCHPchem_INTERNAL_RESTART_FILE not use an initial restart file
# -------------------------------------
GCHPchem_INTERNAL_RESTART_FILE:     +initial_GEOSChem_rst.c24_fullchem.nc
GCHPchem_INTERNAL_RESTART_TYPE:     pnc4
GCHPchem_INTERNAL_CHECKPOINT_FILE:  gcchem_internal_checkpoint
GCHPchem_INTERNAL_CHECKPOINT_TYPE:  pnc4

with

GCHPchem_INTERNAL_RESTART_FILE: gcchem_internal_checkpoint

PS: should I use the "+" before the file name?
It is not completely clear to me what "+" does, a part from the following message in the std output:
WARNING: use of '+' or '-' in the restart name 'initial_GEOSChem_rst.c24_fullchem.nc' allows bootstrapping!

Within a script, I would do something like:

di="${Start_Time_Date}"    
de="${End_Time_Date}"
nd=${Duration_Date}  

# Daily loop
rm -rf CONTINUE_SIM
while [ $di -le $de ] ; do
    if [ ! -f CONTINUE_SIM ] ; then  
        rm -f cap_restart gcchem*    # The avoid re-start from a previous simulation
        ./runConfig.sh
        touch CONTINUE_SIM
    else
        sed -i "s/initial_GEOSChem_rst.c${CS_RES}_fullchem.nc/gcchem_internal_checkpoint/g" GCHP.rc # To re-start from the previous simulation
    fi
    mpirun -np $PAR_TOTAL_CORES --use-hwthread-cpus ./gchp &> out.${di}-segment.log || exit 1
    di=`date -u -d "$di $nd days" +%Y%m%d`
done

Is that correct ?


For coherence with other model outputs, I would also output daily files with hourly statistics (one file per day, with 24 times)
To do that, is that correct to set

timeAvg_freq="010000"
par_timeAvg_dur="250000"

inst_freq="010000"
par_inst_dur="250000"

?

Thank you in advance for your help.

@lorenzocostantino lorenzocostantino added the category: Question Further information is requested label Feb 10, 2022
@LiamBindle
Copy link
Contributor

LiamBindle commented Feb 11, 2022

Hi @lorenzocostantino, thanks the this question too.

The primary objective of multi-segment runs is to break the simulation up into a series of consecutive jobs for your scheduler. Your script is looping over the days, so presumably this would be a single job that is submitted to your scheduler. In this case, there isn't a reason to run the simulation as a multi-segmented run—you would be better off turning periodic checkpointing on and setting the duration equal to the total length of your simulation (i.e., a single segment).

Assuming you do actually want to run your simulation as a series of multi-segment runs, here are a few notes for your consideration:

  1. Generally, I aim for segments that take ~24 hours of real time. For a C180 simulation I would choose 20-30 days for the segment length. This will also cut down on the storage needed for your restart files (365 restart files at C180 would be ~7 TB). Longer segements also mean that you will incur the simulation's initialization time less often (the first day of a segment is slower than the rest because the simulation needs to go through initialization).
  2. Generally, it's helpful if your run script is reentrant. That is, when it starts it should automatically resume from the right place, according to the current state of your run directory.

With these two points in mind, you might consider a configuration like this:

runConfig.sh:

Start_Time_Date="20181217"   # See GCHP#197
End_Time_Date="20200101"
Duration_Date="00000020"     # 20 day segments

and then a run script of

#SCHEDULER_DIRECTIVES
#...
#SCHEDULER_DIRECTIVES

function last_checkpoint() {
    ls -1 gcchem_internal_checkpoint*.nc4 | tail -n 1
}
function last_checkpoint_date() {
    last_checkpoint | sed 's/gcchem_internal_checkpoint.\([12][0-9][0-9][0-9][0-1][0-9][0123][0-9]\).*/\1/'
}

# Configure starting/resuming the simulation
if ls -1 gcchem_internal_checkpoint*.nc4 &> /dev/null ; then
    # no checkpoint file exists, therefore, initialize the start of a simulation
    ./runConfig.sh 
    RESTART_DATE=${Start_Time_Date}
else
    # a timestamped checkpoint file exists, therefore, resume from the most recent one
    RESTART_FILE=$(last_checkpoint)
    RESTART_DATE=$(last_checkpoint_date)
    echo "$RESTART_DATE 000000" > cap_restart
    sed -i "s/GCHPchem_INTERNAL_RESTART_FILE: .*/GCHPchem_INTERNAL_RESTART_FILE: $RESTART_FILE/g" GCHP.rc
fi
mpirun -np $PAR_TOTAL_CORES --use-hwthread-cpus ./gchp &>> out.${RESTART_DATE}-segment.log

This script will resume your simulation from the most recent gcchem_internal_checkpoint.*.nc4 file in your run directory. Regardless of if the previous job failed or succeeded, this script should resume your simulation from the last segment that completed successfully.

The idea is that you would submit this job to your scheduler multiple times, and use job dependencies to get them to run one after the other (with LSF this is the -w option, and with SLURM this is the --dependency option). For example, to get run your simulation for 200 days you would need to submit this job 10 times.


Here are my responses to the other questions you had

If Periodic_Checkpoint=OFF, at the end of each segment only the "gcchem_internal_checkpoint" file is written, overwriting the "gcchem_internal_checkpoint" file of the previous run.
If I am not wrong, we can use this file as GCHPchem_INTERNAL_RESTART_FILE for segment 2 and onward, updating GCHP.rc (and then re-launch gchp.sh) at line 70

I thought that overwriting gcchem_internal_checkpoint wasn't allowed, but I could be wrong. I thought a fatal error was thrown if gcchem_internal_checkpoint already exists when GCHP goes to write it. This is why I've always opted for the timestamped restart files. Please let me know if you find otherwise.

PS: should I use the "+" before the file name?

No, you can omit it. IIRC, the + means missing variables are initialized to zero. Your checkpoint files should have all the species, so you don't need a +.


Here are some extra things for your consideration.

For simulations like this, it's often easiest to test your simulation configuration at a low resolution like c24 or c48. Once your simualtion working good, then increase the resolution and resources.

If you haven't already, it would be a good idea to consider writing a custom collection for HISTORY.rc. A custom collection (as opposed to the default ones) can save you a ton of storage and it also makes analysis down the road a lot easier. The documentation for this is here. For reference, here is a list of fields we commonly use in our group (gaseous and PM AQ species).


Hope this is helpful. Let me know if you have any questions or if I've misunderstood anything.

@lorenzocostantino
Copy link
Author

Hi @LiamBindle, thanks for your explanations and insights. Very clear and helpful.
Few further issues to dig a little deeper into the rises points :

  1. After some double-checks, using Periodic_Checkpoint=OFF, for segment 2 and onward I do confirm that you can put
GCHPchem_INTERNAL_RESTART_FILE: gcchem_internal_checkpoint
...
GCHPchem_INTERNAL_CHECKPOINT_FILE: gcchem_internal_checkpointand 

and then GCHP overwrites the previous gcchem_internal_checkpoint without any error (and cap_restart is also updated automatically to the following segment).

  1. If I understand correctly, if I put Periodic_Checkpoint=ON for multiruns, I should  also need to change runConfig script? as
Checkpoint_Ref_Date : from START to the end date of my segment
Checkpoint_Ref_Time : from START to the end hour of my segment

otherwise, GC writes it at the beginning of the simulation and the checkpoint file will not include all following simulated hours.

  1. If I  am right, I would also stress a little difference in frequency output definition in runConfig, where
    par_timeAvg_dur="250000" to write a GCHP data collection output every 24 hours
    Checkpoint_Freq="240000" to write a Periodic_Checkpoint output every 24 hours
    Is this correct ?

  2. The possibility you mentioned to modify HISTORY.nc for re-shaping GCHP output as needed is extremely useful, indeed.
    Do you confirm that in this GCHP version (13.3.4) there is no PM10 diagnostic variable ? (while PM2.5 is already present)
    As said in : http://wiki.seas.harvard.edu/geos-chem/index.php/Particulate_matter_in_GEOS-Chem#PM2.5_and_PM10_diagnostics_for_GEOS-Chem
    In first approximation, does it make sense to use the definition reported in this wiki ? and quantify PM10 as

PM10 = PM2.5 
     + ( DST2 * 0.7  )
     + DST3
     + ( DST4 * 0.9  )
     + ( SALC * 1.86 )   # NOTE: The value of 1.86 is the SSA_GROWTH factor at 35% RH
  1. If I create a new collection in HISTORY.nc, as I did adding your ACAGGaseous collection, should I also add it into the runConfig script ? In the following lines
# Time-averaged HISTORY.rc collections to auto-update
timeAvg_collections=(ACAGGaseous \   #  <---- my new collection
                     SpeciesConc    \
                     AerosolMass    \
                     Aerosols       \
                     Budget         \
                     CloudConvFlux  \
                     ConcAfterChem  \
                   etc.
                   etc.
)

(I didn't add my new collection into the runConfig script and everything seems to work fine, ... but I would like to be sure...)

@lorenzocostantino
Copy link
Author

lorenzocostantino commented Feb 24, 2022

Hi @LiamBindle, thanks for your explanations and insights. Very clear and helpful.
Few further issues to dig a little deeper into the rised points :

  1. After some double-checks, using Periodic_Checkpoint=OFF, for segment 2 and onward I do confirm that you can put
GCHPchem_INTERNAL_RESTART_FILE: gcchem_internal_checkpoint
...
GCHPchem_INTERNAL_CHECKPOINT_FILE: gcchem_internal_checkpointand 

and then GCHP overwrites the previous gcchem_internal_checkpoint without any error (and cap_restart is also updated automatically to the following segment).

  1. If I understand correctly, if I put Periodic_Checkpoint=ON for multiruns, I should  also need to change runConfig script as
Checkpoint_Ref_Date : from START to the end date of my segment
Checkpoint_Ref_Time : from START to the end hour of my segment

otherwise, GC writes it at the beginning of the simulation and the checkpoint file will not include all following simulated hours.

  1. If I  am right, I would also stress a little difference in frequency output definition in runConfig, where
    par_timeAvg_dur="250000" to write a GCHP data collection output every 24 hours
    Checkpoint_Freq="240000" to write a Periodic_Checkpoint output every 24 hours
    Is this correct ?

  2. The possibility you mentioned to modify HISTORY.nc for re-shaping GCHP output as needed is extremely useful, indeed.
    Do you confirm that in this GCHP version (13.3.4) there is no PM10 diagnostic variable ? (while PM2.5 is already present)
    As said in : http://wiki.seas.harvard.edu/geos-chem/index.php/Particulate_matter_in_GEOS-Chem#PM2.5_and_PM10_diagnostics_for_GEOS-Chem
    In first approximation, does it make sense to use the definition reported in this wiki ? and quantify PM10 as

PM10 = PM2.5 
     + ( DST2 * 0.7  )
     + DST3
     + ( DST4 * 0.9  )
     + ( SALC * 1.86 )   # NOTE: The value of 1.86 is the SSA_GROWTH factor at 35% RH
  1. If I create a new collection in HISTORY.nc, as I did adding your ACAGGaseous collection, should I also add it into the runConfig script ? In the following lines
# Time-averaged HISTORY.rc collections to auto-update
timeAvg_collections=(ACAGGaseous \   #  <---- my new collection
                     SpeciesConc    \
                     AerosolMass    \
                     Aerosols       \
                     Budget         \
                     CloudConvFlux  \
                     ConcAfterChem  \
                   etc.
                   etc.
)

(I didn't add my new collection into the runConfig script and everything seems to work fine, ... but I would like to be sure...)

@LiamBindle
Copy link
Contributor

LiamBindle commented Mar 21, 2022

Sorry @lorenzocostantino, this fell through the cracks!

  1. Thanks for confirming. I need to review how gcchem_internal_checkpoint works then.
  2. In runConfig.sh, the defaults should be Checkpoint_Ref_Date=START and Checkpoint_Ref_Time=START. This should be left as is. The reference date and time refer to the initial date from which the checkpointing frequency is interpreted. For example, if you checkpoint frequency were 48 hours and your reference date was 2019-01-05, then you would get checkpoint files on the 7th, 9th, ..., etc.
  3. I believe it should always be timeAvg_dur=240000 if you want your time-averaged collections every 24 hours. I find the handling of diagnostic frequency/period by runConfig.sh confusing, so personally, I set AutoUpdate_Diagnostics=OFF and manually edit HISTORY.rc. The documentation is here and I think it's easier to understand if you edit that file directly.
  4. I don't know off the top of my head. You could check if it's available by adding an entry to your collection like 'PM25 ', 'GCHPchem',. Your simulation will crash if it isn't available.
  5. That's correct.

Edit: Regarding 4, @yantosca says that the PM10 diagnostic will be available in 13.4.

@LiamBindle
Copy link
Contributor

Oh, I just noticed that #162 wasn't fixed in 13.3. This was a bug that caused the timestamp in the filename of diagnostic files to be incorrect (the timestamp in the filename was the wrong day, but the time coordinate in the file was actually okay). I just realized that point 3 might have been an attempted workaround for this issue. If that's the case, you could try cherry-picking the fix in the MAPL repo by doing cd GCHP/src/MAPL; git checkout gchp/dev. Let me know if you would like me to elaborate.

@lorenzocostantino
Copy link
Author

Hi @LiamBindle,
Thank you so much again.
Yes, you were definitively right. As suggested, I did :
-cd GCHP/src/MAPL
-git checkout gchp/dev
-recompilation
and this solved the time average issue : timeAvg_dur=240000 now leads to a time-averaged collections every 24 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants