Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: No jobs completing successfully on zppy 2.4.0rc1 #561

Closed
forsyth2 opened this issue Mar 23, 2024 · 16 comments
Closed

[Bug]: No jobs completing successfully on zppy 2.4.0rc1 #561

forsyth2 opened this issue Mar 23, 2024 · 16 comments
Labels
semver: bug Bug fix (will increment patch version)

Comments

@forsyth2
Copy link
Collaborator

What happened?

No jobs are completing successfully on zppy 2.4.0rc1. My best guess is that simulation input data was accidentally removed in the most recent LCRC scrubbing (March 6, targeting data 2 years or older).

After I got extra variables to plot for global time series (#400), I wanted to make sure the original plots still worked. So, I made a cfg with the original simulation input and var list. I encountered a "Missing input files" error on the ts task. I was thinking it was related to the nc files appearing in quotes for some reason (see #400 (comment)). But then I looked at the MPAS error which was FileNotFoundError: [Errno 2] No such file or directory: '/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201/run/mpaso_in'. That seemed odd, since I had changed nothing whatsoever with MPAS Analysis.

I then checked out a branch identical to main and ran the complete_run test. Nothing passed.

What machine were you running on?

Chrysalis

Environment

Dev environment: zppy_dev_20240322

What command did you run?

zppy -c tests/integration/generated/test_complete_run_chrysalis.cfg

Copy your cfg file

[default]
case = v2.LR.historical_0201
constraint = ""
dry_run = "False"
environment_commands = ""
input = "/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201"
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
# To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_complete_run.py
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test_main_20240322/v2.LR.historical_0201"
partition = "debug"
qos = "regular"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/test_main_20240322"

[climo]
active = True
walltime = "00:30:00"
years = "1850:1854:2", "1850:1854:4",

  [[ atm_monthly_180x360_aave ]]
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  vars = ""

  [[ atm_monthly_diurnal_8xdaily_180x360_aave ]]
  frequency = "diurnal_8xdaily"
  input_files = "eam.h4"
  input_subdir = "archive/atm/hist"
  vars = "PRECT"

  [[ land_monthly_climo ]]
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  vars = ""

[ts]
active = True
e3sm_to_cmip_environment_commands = "source /home/ac.forsyth2/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_to_cmip_20240322"
walltime = "00:30:00"
years = "1850:1854:2",

  [[ atm_monthly_180x360_aave ]]
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  ts_fmt = "cmip"

  [[ atm_daily_180x360_aave ]]
  frequency = "daily"
  input_files = "eam.h1"
  input_subdir = "archive/atm/hist"
  vars = "PRECT"

  [[ atm_monthly_glb ]]
  # Note global average won't work for 3D variables.
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  mapping_file = "glb"
  years = "1850:1860:5",

  [[ land_monthly ]]
  e3sm_to_cmip_environment_commands = "source /home/ac.forsyth2/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_to_cmip_20240322"
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  vars = "LAISHA,LAISUN"
  ts_fmt = "cmip"

  [[ land_monthly_glb ]]
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  mapping_file = "glb"

  [[ rof_monthly ]]
  extra_vars = 'areatotal2'
  frequency = "monthly"
  input_files = "mosart.h0"
  input_subdir = "archive/rof/hist"
  mapping_file = ""
  vars = "RIVER_DISCHARGE_OVER_LAND_LIQ"

[tc_analysis]
active = True
scratch = "/lcrc/globalscratch/ac.forsyth2/"
walltime = "00:30:00"
years = "1850:1854:2",

[e3sm_diags]
active = True
grid = '180x360_aave'
ref_final_yr = 2014
ref_start_yr = 1985
# TODO: this directory is missing OMI-MLS
sets = "lat_lon","zonal_mean_xy","zonal_mean_2d","polar","cosp_histogram","meridional_mean_2d","enso_diags","qbo","diurnal_cycle","annual_cycle_zonal_mean","streamflow", "zonal_mean_2d_stratosphere", "tc_analysis",
short_name = 'v2.LR.historical_0201'
ts_num_years = 2
walltime = "00:30:00"
years = "1850:1854:2", "1850:1854:4",

  [[ atm_monthly_180x360_aave ]]
  climo_diurnal_frequency = "diurnal_8xdaily"
  climo_diurnal_subsection = "atm_monthly_diurnal_8xdaily_180x360_aave"
  partition = "compute"
  qos = "regular"
  sets = "lat_lon","zonal_mean_xy","zonal_mean_2d","polar","cosp_histogram","meridional_mean_2d","enso_diags","qbo","diurnal_cycle","annual_cycle_zonal_mean","streamflow", "zonal_mean_2d_stratosphere",
  walltime = "5:00:00"

  [[ atm_monthly_180x360_aave_environment_commands ]]
  environment_commands = "source /home/ac.forsyth2/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_diags_20240322"
  sets = "qbo",
  ts_subsection = "atm_monthly_180x360_aave"

  [[ atm_monthly_180x360_aave_tc_analysis ]]
  # Running as its own subtask because tc_analysis requires jobs to run sequentially, which slows down testing
  sets = "tc_analysis",
  years = "1850:1852:2",

  [[ atm_monthly_180x360_aave_mvm ]]
  # Test model-vs-model using the same files as the reference
  climo_diurnal_frequency = "diurnal_8xdaily"
  climo_diurnal_subsection = "atm_monthly_diurnal_8xdaily_180x360_aave"
  climo_subsection = "atm_monthly_180x360_aave"
  diff_title = "Difference"
  partition = "compute"
  qos = "regular"
  ref_final_yr = 1851
  ref_name = "v2.LR.historical_0201"
  ref_start_yr = 1850
  ref_years = "1850-1851",
  reference_data_path = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test_main_20240322/v2.LR.historical_0201/post/atm/180x360_aave/clim"
  run_type = "model_vs_model"
  short_ref_name = "v2.LR.historical_0201"
  swap_test_ref = False
  tag = "model_vs_model"
  ts_num_years_ref = 2
  ts_subsection = "atm_monthly_180x360_aave"
  walltime = "5:00:00"
  years = "1852-1853",

  [[ lnd_monthly_mvm_lnd ]]
  # Test model-vs-model using the same files as the reference
  climo_subsection = "land_monthly_climo"
  diff_title = "Difference"
  #grid = 'native'
  partition = "compute"
  qos = "regular"
  ref_final_yr = 1851
  ref_name = "v2.LR.historical_0201"
  ref_start_yr = 1850
  ref_years = "1850-1851",
  reference_data_path = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test_main_20240322/v2.LR.historical_0201/post/lnd/180x360_aave/clim"
  run_type = "model_vs_model"
  sets = "lat_lon_land",
  short_ref_name = "same simulation"
  swap_test_ref = False
  tag = "model_vs_model"
  ts_num_years_ref = 2

[mpas_analysis]
active = True
anomalyRefYear = 1850
climo_years ="1850-1854", "1855-1860",
enso_years = "1850-1854", "1855-1860",
mesh = "EC30to60E2r2"
parallelTaskCount = 6
partition = "compute"
qos = "regular"
ts_years = "1850-1854", "1850-1860",
walltime = "00:30:00"

[global_time_series]
active = True
climo_years ="1850-1854", "1855-1860",
experiment_name = "v2.LR.historical_0201"
figstr = "v2_historical_0201"
moc_file=mocTimeSeries_1850-1860.nc
ts_num_years = 5
ts_years = "1850-1854", "1850-1860",
walltime = "00:30:00"
years = "1850-1860",

[ilamb]
active = True
grids = '180x360_aave'
nodes = 8
partition = "compute"
short_name = 'v2.LR.historical_0201'
ts_num_years = 2
years = "1850:1854:2",

  [[ land_monthly ]]

What jobs are failing?

climo_atm_monthly_180x360_aave_1850-1851.status:ERROR (3)
climo_atm_monthly_180x360_aave_1850-1853.status:ERROR (3)
climo_atm_monthly_180x360_aave_1852-1853.status:ERROR (3)
climo_atm_monthly_diurnal_8xdaily_180x360_aave_1850-1851.status:ERROR (1)
climo_atm_monthly_diurnal_8xdaily_180x360_aave_1850-1853.status:ERROR (1)
climo_atm_monthly_diurnal_8xdaily_180x360_aave_1852-1853.status:ERROR (1)
climo_land_monthly_climo_1850-1851.status:ERROR (3)
climo_land_monthly_climo_1850-1853.status:ERROR (3)
climo_land_monthly_climo_1852-1853.status:ERROR (3)
e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status:WAITING 489952
e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1853.status:WAITING 489954
e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1852-1853.status:WAITING 489953
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.status:WAITING 489949
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.status:WAITING 489951
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.status:WAITING 489950
e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.status:WAITING 489956
e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.status:WAITING 489955
e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851.status:WAITING 489957
global_time_series_1850-1860.status:WAITING 489960
ilamb_1850-1851.status:WAITING 489961
ilamb_1852-1853.status:WAITING 489962
mpas_analysis_ts_1850-1854_climo_1850-1854.status:ERROR (2)
mpas_analysis_ts_1850-1860_climo_1855-1860.status:WAITING 489959
tc_analysis_1850-1851.status:RUNNING 489947
tc_analysis_1852-1853.status:WAITING 489948
ts_atm_daily_180x360_aave_1850-1851-0002.status:ERROR (1)
ts_atm_daily_180x360_aave_1852-1853-0002.status:ERROR (1)
ts_atm_monthly_180x360_aave_1850-1851-0002.status:ERROR (1)
ts_atm_monthly_180x360_aave_1852-1853-0002.status:ERROR (1)
ts_atm_monthly_glb_1850-1854-0005.status:ERROR (1)
ts_atm_monthly_glb_1855-1859-0005.status:ERROR (1)
ts_land_monthly_1850-1851-0002.status:ERROR (1)
ts_land_monthly_1852-1853-0002.status:ERROR (1)
ts_land_monthly_glb_1850-1851-0002.status:ERROR (1)
ts_land_monthly_glb_1852-1853-0002.status:ERROR (1)
ts_rof_monthly_1850-1851-0002.status:ERROR (1)
ts_rof_monthly_1852-1853-0002.status:ERROR (1)

What stack trace are you encountering?

No response

@forsyth2 forsyth2 added the semver: bug Bug fix (will increment patch version) label Mar 23, 2024
@forsyth2
Copy link
Collaborator Author

Example errors:

climo_atm_monthly_180x360_aave, climo_land_monthly_climo ERROR (3):

$ cat climo_atm_monthly_180x360_aave_1850-1851.o489926 
...
ncclimo: ERROR Unable to find required input file /lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201/archive/atm/hist/v2.LR.historical_0201.eam.h0.1850-01.nc
ncclimo: HINT All files implied to exist by the climatology bounds (start/end year/month) and by the specified (with -P or -m) or default model type, must be in /lcrc/group/e3sm/ac.for\
syth2//E3SMv2/v2.LR.historical_0201/archive/atm/hist before ncclimo will proceed

climo_atm_monthly_diurnal_8xdaily ERROR (1):

$ cat climo_atm_monthly_diurnal_8xdaily_180x360_aave_1850-1851.o489929
Missing input files

mpas_analysis ERROR (2) (there were actually many, many stack traces in the output):

$ cat mpas_analysis_ts_1850-1854_climo_1850-1854.o489958
...
Traceback (most recent call last):
...
FileNotFoundError: [Errno 2] No such file or directory: '/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201/run/mpassi_in'
Warning: mpasTimeSeriesSeaIce failed during check and will not be run
Warning: prerequisite of timeSeriesSeaIceAreaVol failed during check, so this task will not be run

ts_atm_daily, ts_atm_monthly_glb, ts_land_monthly_glb ERROR (1):

$ cat ts_atm_daily_180x360_aave_1850-1851-0002.o489937 
ts_only
Missing input files

ts_atm_monthly, ts_land_monthly, ts_rof_monthly ERROR (1):

$ cat ts_atm_monthly_180x360_aave_1850-1851-0002.o489935 
cmip
Missing input files

@forsyth2
Copy link
Collaborator Author

$ cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
$ du -sh .
17M	.
$ 

I'm essentially certain the size was in gigabytes before... I think the data did indeed get scrubbed.

@forsyth2
Copy link
Collaborator Author

@xylar @chengzhuzhang This needs to be resolved before we can release zppy v2.4.0. In theory, things are still working fine, but we need valid test input data to know that for sure...

@chengzhuzhang I suppose this presents a good opportunity to convert the tests over to testing on v3 data rather than the older v2 simulation data. That said, it would probably be quicker to try to restore the v2 data to Chrysalis by transferring from the HPSS archive.

@xylar
Copy link
Contributor

xylar commented Mar 23, 2024

@forsyth2, it is not going to work to store testing data in your own scratch space. I store very small test runs under diagnostics/mpas_analysys and they get synced to other machines. Many GB is like too much, though. We would need to arrange a space with @rljacob that doesn't get scrubbed but also doesn't get synced for larger testing data.

@forsyth2
Copy link
Collaborator Author

@xylar Oh I see. Well actually, syncing would be quite useful because we want to run the tests on chrysalis, compy, and Perlmutter.

@xylar
Copy link
Contributor

xylar commented Mar 25, 2024

@forsyth2, how much data are we talking about?

@chengzhuzhang
Copy link
Collaborator

@forsyth2, how much data are we talking about?

I had the same question. Right now we can just globus transfer over one copy from NERSC disk. In the feature, we may need to keep the testing data somewhere can be excepted from scrubbing.

@forsyth2
Copy link
Collaborator Author

@xylar So the data is also on Perlmutter for testing there:

$ cd /global/cfs/cdirs/e3sm/forsyth/E3SMv2
$ du -sh v2.LR.historical_0201/
24T	v2.LR.historical_0201/

So, 24 terabytes.

@forsyth2
Copy link
Collaborator Author

So, not gigabyte range, but certainly more than the 17M post-scrubbing on LCRC.

@xylar
Copy link
Contributor

xylar commented Mar 26, 2024

That is about 20 times the size of the diagnostics folder so unfortunately that's not acceptable.

I think you both will need to come up with another space to use and figure out how to handle scrubbing.

@xylar
Copy link
Contributor

xylar commented Mar 26, 2024

@forsyth2, can you find a smaller dataset to use for testing or use a smaller subset of the v2.LR.historical_0201? I would be okay with adding something on the order of 100-200 GB to the diagnostics directory if that's feasible.

@chengzhuzhang
Copy link
Collaborator

@xylar I think it is less critical to support these testing data in diagnostics directory, because unlike diagnostics data and other testing data, this dataset won't change over time. So we don't really need to use functionality like machine sync, etc. In this case, to make sure the directory hosting v2.LR.historical_0201 is exempted from scrubbing will be good. And it is a good point to reducing the data size.

@forsyth2
Copy link
Collaborator Author

I would be okay with adding something on the order of 100-200 GB

Oh I got the size ordering (terabyte > gigabyte) mixed up. Yes, terabyte is quite large, I see.

it is a good point to reducing the data size.

Yes, I suppose this is feasible, since we only test on ~10 years of data. It just seemed complicated to try to delete any data associated with the remainder of the time period. I figured we had the space, so why not just have it all? But I see now, we do not in fact have the space.

@xylar
Copy link
Contributor

xylar commented Mar 26, 2024

24 TBs is quite a lot of space so, no, I don't think we have that much space to spare in general.

@forsyth2
Copy link
Collaborator Author

Ok, I'll get through testing of this RC and then work on reducing the test data size.

@forsyth2 forsyth2 mentioned this issue Mar 26, 2024
3 tasks
@forsyth2
Copy link
Collaborator Author

Ok I got the data transferred back to Chrysalis (6 hours on Globus) and ran the tests successfully on main, so I think the latest RC of zppy is fine then. I'm going to close this issue.

I'll plan to reduce the testing data size when we do v3 (#552). We shouldn't be testing on v2 for too much longer anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

No branches or pull requests

3 participants