Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ne1024 F case run on Cori aborted from namelist_mod #3417

Closed
dqwu opened this issue Jan 28, 2020 · 10 comments
Closed

ne1024 F case run on Cori aborted from namelist_mod #3417

dqwu opened this issue Jan 28, 2020 · 10 comments

Comments

@dqwu
Copy link
Contributor

dqwu commented Jan 28, 2020

This failure occurred in less than 5 minutes after case.run started. The error stack trace is:

e3sm.exe           0000000002EA09AC  shr_abort_mod_mp_         114  shr_abort_mod.F90
e3sm.exe           000000000206E4C5  namelist_mod_mp_r         878  namelist_mod.F90
e3sm.exe           0000000002034009  dyn_comp_mp_dyn_i         135  dyn_comp.F90
e3sm.exe           0000000001AEEB62  inital_mp_cam_ini          31  inital.F90
e3sm.exe           00000000004EFF22  cam_comp_mp_cam_i         159  cam_comp.F90
e3sm.exe           00000000004E3718  atm_comp_mct_mp_a         312  atm_comp_mct.F90
e3sm.exe           0000000000424EAF  component_mod_mp_         257  component_mod.F90
e3sm.exe           0000000000412B91  cime_comp_mod_mp_        1338  cime_comp_mod.F90
e3sm.exe           0000000000421B94  MAIN__                    122  cime_driver.F90

It has been confirmed that this issue is caused by PR #3368. In the conversation of that PR, a similar stack trace was reported from failed SMS_D_Ln5.ne4_ne4.FC5AV1C-L.anvil_intel.cam-cosplite.

Below are detailed steps to reproduce it on Cori. The job wall-time can be set to 10 minutes to reduce wait time.

PS, in previous successful ne1024 F case runs on Cori, "se_phys_tscale = 0" is set in user_nl_cam. This obsolete namelist variable has been removed by PR #3368 so it is no longer used in the steps below.

git clone https://github.com/E3SM-Project/E3SM.git
cd E3SM
git submodule update --init

cd cime/scripts
./create_newcase --case FC5AV1C-H01A_ne1024np4_360x720cru_oRRS15to5 --compset FC5AV1C-H01A --res ne1024np4_360x720cru_oRRS15to5

cd FC5AV1C-H01A_ne1024np4_360x720cru_oRRS15to5
./xmlchange STOP_OPTION=nhours,STOP_N=1
./xmlchange NTASKS=16384
./xmlchange NTHRDS_ATM=16
./xmlchange MAX_MPITASKS_PER_NODE=8
./xmlchange MAX_TASKS_PER_NODE=128
./xmlchange CAM_CONFIG_OPTS="-phys cam5 -clubb_sgs -microphys mg2 -chem none -nlev 72"
./xmlchange EPS_AGRID="1.0e-10"
./xmlchange RUN_STARTDATE=2016-08-01
./xmlchange PIO_NETCDF_FORMAT=64bit_data
./xmlchange PIO_BUFFER_SIZE_LIMIT=134217728
./xmlchange ATM_NCPL=288
./xmlchange LND2ATM_FMAPTYPE="X"
./xmlchange LND2ATM_SMAPTYPE="X"
./xmlchange CAM_TARGET=theta-l

cat <<EOF >> user_nl_cam
    use_hetfrz_classnuc = .false.
    aerodep_flx_type = 'CYCLICAL'
    aerodep_flx_datapath = '/global/cfs/cdirs/acme/inputdata/atm/cam/chem/trop_mam/aero' 
    aerodep_flx_file = 'mam4_0.9x1.2_L72_2000clim_c170323.nc'
    aerodep_flx_cycle_yr = 01
    prescribed_aero_type = 'CYCLICAL'
    prescribed_aero_datapath='/global/cfs/cdirs/acme/inputdata/atm/cam/chem/trop_mam/aero'
    prescribed_aero_file='mam4_0.9x1.2_L72_2000clim_c170323.nc'
    prescribed_aero_cycle_yr = 01
EOF

cat <<EOF >> user_nl_cam
    !
    !  for SL transport:  keep rsplit=1, and adjust se_nsplit and qsplit
    !  to get the correct dt_dyn and dt_tracers
    !
    !  dt_dyn should be the same as with the v1 code
    !  dt_tracers should be up to 6x larger than dt_dyn
    !
    se_ne                 = 1024
    transport_alg         = 0  ! 12 for semi-lagrangian
    semi_lagrange_cdr_alg = 20 
    se_ftype              = 4 

    ! Set timesteps
    se_nsplit             = 30
    rsplit                = 1
    qsplit                = 1
    se_limiter_option     = 9  
    semi_lagrange_nearest_point_lev = 100 

    ! Set hyperviscosity
    hypervis_order        = 2
    hypervis_scaling      = 0
    hypervis_subcycle     = 2  ! Set to make dt_vis ~ 0.5s
    hypervis_subcycle_tom = 32 ! Set to make dt_vis ~ 0.5s
    hypervis_subcycle_q   = 1
    nu_div                = 2.5e10  !6.25e10
    nu                    = 2.5e10
    nu_p                  = 2.5e10
    nu_top                = 2.5e5
    nu_q                  = -1  ! 0 ! No hypervisocity for semi-lagrangian tracers
    se_partmethod         = 4

    ! Use hydrostatic mode
    theta_hydrostatic_mode=.true. 
    tstep_type=5 
    theta_advect_form=1 

    ! Using Tempest maps, need element local projection from reference element space
    ! to the sphere
    cubed_sphere_map = 2

    ! Paths to new input data
    drydep_srf_file = '/project/projectdirs/acme/inputdata/atm/cam/chem/trop_mam/atmsrf_ne1024np4_20190621.nc'
    ncdata = '/global/cscratch1/sd/wlin/DYAMOND/inputdata/ifs_oper_T1279_2016080100_mod_subset_to_e3sm_ne1024np4_topoadj_L72.nc'

    ! Timestep output for debugging
    !nhtfrq = 0,1
    !mfilt  = 1,48
    !avgflag_pertape = 'A', 'I'
    !fincl2 = 'OMEGA500', 'TMQ', 'PRECT', 'PSL', 'TGCLDLWP'

    ! Write initial conditions more frequently
    inithist = 'DAILY'
EOF

cat <<EOF >> user_nl_clm
    finidat = '/global/cscratch1/sd/wlin/acme_scratch/cori-knl/ICRUCLM45-360x720cru/run/ICRUCLM45-360x720cru.clm2.r.2016-08-01-00000.nc'
EOF

./case.setup

./xmlchange --file env_run.xml TPROF_TOTAL=-1

echo "tprof_n = 1" >> user_nl_cpl
echo "tprof_option = 'nsteps'" >> user_nl_cpl

./case.build

./case.submit
@mt5555
Copy link
Contributor

mt5555 commented Jan 28, 2020

based on the stack trace, there might be a message for why the code called abort. can you check near the end of the atm.log and e3sm.log files?

I suspect it's because the "theta-l" dycore has a new more rational way to set the timesteps, so the new defaults interfere with the namelist above, which is setting the old style splitting parameters.

To keep using this old namelist, you might just need to add: "se_tstep=-1"

@dqwu
Copy link
Contributor Author

dqwu commented Jan 28, 2020

@mt5555
You are correct, both atm.log and e3sm.log contain this error message:
ERROR: Only SL transport supports vertical remap time step < tracer time step.

@dqwu
Copy link
Contributor Author

dqwu commented Jan 28, 2020

@mt5555
I will try "se_tstep=-1" to rerun ne1024 F case on Cori to see if that works.

@dqwu
Copy link
Contributor Author

dqwu commented Jan 30, 2020

@mt5555
I have a new case run that sets se_tsetp to -1 in user_nl_cam:

    ...
    ! Set timesteps
    se_nsplit             = 30
    rsplit                = 1
    qsplit                = 1
    se_limiter_option     = 9  
    semi_lagrange_nearest_point_lev = 100 
    se_tstep              = -1
    ...

However, the case still failed, with exactly the same trace stack and error message.
Do you have other possible workarounds for me to test? Thanks.

@mt5555
Copy link
Contributor

mt5555 commented Jan 31, 2020

can you send me the atm.log file?

@dqwu
Copy link
Contributor Author

dqwu commented Jan 31, 2020

@mt5555
/global/homes/d/dqwu/shared/atm.log.27733644.200130-075127

@mt5555
Copy link
Contributor

mt5555 commented Jan 31, 2020

based on the log file, can you try adding all of these:

se_tstep=-1
dt_remap_factor=-1
dt_tracer_factor=-1

@dqwu
Copy link
Contributor Author

dqwu commented Feb 3, 2020

@mt5555
"se_tstep=-1 dt_remap_factor=-1 dt_tracer_factor=-1" works.

@brhillman
Copy link
Contributor

@dqwu this does not look like a bug, it looks like just a problem setting the timesteps in a yet pretty custom configuration. We are actively working on dialing in the configuration for ne1024 before putting the defaults into the CIME configuration, so this is very much a work in progress. For now, I suggest using @mt5555 suggestion above that seems like it worked for you. I am testing a more "out of the box" default right now, but I'd consider this issue closed since the "error" appears to be just erroneous custom settings. The code does not appear to be doing anything unexpected to me.

@dqwu dqwu removed the bug label Feb 4, 2020
@dqwu
Copy link
Contributor Author

dqwu commented Feb 4, 2020

@brhillman
Many thanks for your information.

@mt5555
I think we can close this issue for now.

@mt5555 mt5555 closed this as completed Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants