Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic resubmit does not work on several platforms #373

Closed
jeff-cohere opened this issue Oct 19, 2015 · 21 comments
Closed

Automatic resubmit does not work on several platforms #373

jeff-cohere opened this issue Oct 19, 2015 · 21 comments
Assignees

Comments

@jeff-cohere
Copy link
Contributor

(via Peter Caldwell)

The following link defines a case that runs successfully but fails to restart on Edison:

https://gist.github.com/jnjohnsonlbl/dfc2e5828d02f859d4cf

@rljacob
Copy link
Member

rljacob commented Oct 26, 2015

Looks like this is basically an FC5 case with ne30_g16 resolution. Anything else needed to cause the restart fail?

@rljacob
Copy link
Member

rljacob commented Oct 26, 2015

An extact restart test with FC5 and ne30_g16 passes on edison with latest master. By default, it does an 11 day run, makes a restart at day 6, then picks that up and does a 5 day run, then compares results.

@rljacob
Copy link
Member

rljacob commented Oct 26, 2015

Tagging @PeterCaldwell

@mt5555
Copy link
Contributor

mt5555 commented Dec 2, 2015

Is this the issue we are using to track the failure to resubmit on all platforms? (original bug report says fails to restart, but @PeterCaldwell confirmed that he meant the job fails to resubmit a restart run).

@rljacob rljacob changed the title Attached case fails to restart on Edison ACME resubmit does not work on several platforms Dec 2, 2015
@rljacob rljacob changed the title ACME resubmit does not work on several platforms Automatic resubmit does not work on several platforms Dec 2, 2015
@rljacob
Copy link
Member

rljacob commented Dec 2, 2015

Yes. Updated the title.

@abigailgaddis
Copy link

Not sure if this is helpful, but my resubmits are working on Titan in the usual way (continue_run = true in env_run.xml, along with setting the number of days and resubmits). I'm using a slightly older version of ACME that still has the cime/scripts structure. Git tag: v0.4-298-ga47f412

@rljacob
Copy link
Member

rljacob commented Dec 3, 2015

Thanks. machines known of so far are: Edison, Mira, Constance (PNNL). A fix is available, I just need time to merge it.

@kaizhangpnl
Copy link
Contributor

Thanks for the information. Would you mind sharing the fix (in a feature branch?) with us (PNNL team) now?

@rljacob
Copy link
Member

rljacob commented Dec 3, 2015

I mispoke: I don't have a branch either. Its a patch for standalone cime that I still need to put on an ACME feature branch. I'll try to at least get it pushed tomorrow.

@kaizhangpnl
Copy link
Contributor

Great. Thanks!

@PeterCaldwell
Copy link
Contributor

Is this fixed on the latest master now? The previous situation was that the first time you submit a job it fails to resubmit. But if you manually resubmit the model once, it would automatically resubmit itself from then on. @golaz mentioned yesterday that tags this week now not only fail to resubmit initially, but also fail to resubmit themselves subsequently on titan. I'm wondering if this is a symptom of your cure...

@mt5555
Copy link
Contributor

mt5555 commented Dec 10, 2015

Probably not.

ideally, if this is fixed, the pull request would say Fixes #XXXX, and then this issue would automatically be closed when the pull request is merged.

@golaz
Copy link
Contributor

golaz commented Dec 10, 2015

Yes, I was getting errors with the automatic resubmission on titan last week. The error was

"""
Submitting CESM job script: case_scripts.run
: qsub ./case_scripts.run

Job not submitted

qsub input from STDIN is not permitted.
Please use "qsub batch.script".

Job submission failed
"""

Eventually, I ended up disabling automatic resubmission and manually submitting chained jobs with dependencies. That's actually probably more efficient in terms of getting through the queues.

@golaz golaz closed this as completed Dec 10, 2015
@golaz golaz reopened this Dec 10, 2015
@abigailgaddis
Copy link

When I submit on Titan, I've been using ./CASENAME.submit rather than the old qsub *.run. I'm not sure if that's the issue, @golaz?

@cameronsmith1
Copy link
Contributor

I get the same message as @golaz .
When the run is started, it is done using ./CASENAME.submit . The error @golaz provides above occurs after the first run has finished and it is trying to submit a follow-on run.

@rljacob rljacob self-assigned this Jan 7, 2016
@ghost
Copy link

ghost commented Jan 19, 2016

Resubmission works fo me for Edison@Nersc with @amametjanov commit:

c4501ed
Author: Azamat Mametjanov azamat@mcs.anl.gov
Date: Fri Jan 8 19:23:45 2016 -0800

The way it works is that after a run and, possibly, archiving are done,, it sends another run to the queue.
Ideally, you want to specify the number of runs and submit them all with dependencies so that when the first is done, the dependants are ready to go without have to climb the queue again.
Notice that each queue has limits on the number of jobs that can simultaneously be both running or holding on.

@cameronsmith1
Copy link
Contributor

I have found that the problem goes away if short-term archiving is switched on (dout_s=true in env_run.xml). I have no idea why this should be, other than the logic in the script is probably a bit different.

@bmayerornl
Copy link

I am having a failure to resubmit on Titan. I have setup a case while using v1.0.0-alpha.5. The case creation command is: create_newcase -mach titan -compiler pgi -res ne16_g37 -compset F1850C5 -project cli115 -case F1850C5_ne16_g37_p32

After running a 2 month test case, I have modified env_run.xml from the default by:
RESUBMIT = 6
CONTINUE_RUN = TRUE
And of course the ndays -> nmonths and STOP_N to 8

The case runs fine, but does not resubmit itself. When I manually resubmit the case run fine again, but again does not resubmit itself.

@cameronsmith1
Copy link
Contributor

@bmayerornl , have you do you have short-term archiving turned on? That has solved the problem for me in the past (dout_s=true in env_run.xml).

@helenhe40
Copy link

@bmayerornl For the resubmit issue, could you please test the following?

I tested that in your *.run batch script, if you add the "cd $CASEROOT" line below before the if ($RESUBMIT > 0) logic, it will work.

cd $CASEROOT ** notice this is the added line **

if ($RESUBMIT > 0) then
@ RESUBMIT = $RESUBMIT - 1
echo RESUBMIT is now $RESUBMIT
...

Could you please give it a try?

@rljacob
Copy link
Member

rljacob commented Jul 12, 2016

The new version of CIME has a test for resubmit.

Clone it from https://github.com/ESMCI/cime/, cd to cime/scripts and try:
./create_test ERR_Ld3.f45_g37_rx1.A

rljacob added a commit that referenced this issue Aug 11, 2016
12d2135 Merge pull request #388 from ESMCI/jgfouca/need_to_report_build_exceptions
8f677cd Add test to ensure build fails report info to teststatus.log
7095ef0 Need to report build exception contents
7c9cc94 Merge pull request #387 from ESMCI/jgfouca/fix_case_build_return_code
bf941ed case.build needs to check success in order to return a sane error code
df432e8 update ChangeLog
154d5f8 Merge pull request #378 from ESMCI/rljacob/update-config-files
6b8fc76 Merge pull request #382 from ESMCI/sarich/fix-taskmaker-counter
5df46a2 Merge pull request #381 from jedwards4b/test_fixes
e58d624 component_compare_test was not properly reporting failures
0e0e577 Update acme config_files for mpas
84122dd Merge pull request #376 from ESMCI/jgfouca/changes_from_acme
38e2f8a More stuff from ACME
c86398e Merge pull request #375 from ESMCI/jgfouca/portable_run_cmd_utest
daaf621 Merge pull request #373 from ESMCI/jgfouca/enhance_bisect
e61ba96 Change MPAS compset for test
c77a64e Add homme python test
3031f00 Better support for 'none' module system
3ceef7b Make run_cmd_no_fail unit test more portable
bbd20fb Merge branch 'jgfouca/fix_module_list' (PR #374)
6eb8143 Reactivate creation of software_environment.txt
4dd30c6 Ensure module setup is sourced before list
c8bd20e fix bug in translation from perl
20a3412 cime_bisect: Add better support for modifying create_test run
77871de update changelog
e74906b comment out code until it works for tcsh users
4a13413 fix issue with module list
6ad4b2d update changelog
cef688d update changelog
835b511 Merge pull request #367 from jedwards4b/user_mod_0len_fix
b280b55 Merge pull request #362 from ESMCI/jgfouca/remove_perl_taskmaker
12a30ee Merge pull request #356 from ESMCI/jgfouca/minor_timing_chg
3fb0b80 Merge pull request #355 from ESMCI/jgfouca/wait_for_test_refactor
d083933 user_nl_ file was being removed if a user_nl file in any mods directory was missing
39de940 Fix comment
5d38420 Revert "Merge pull request #343 from ESMCI/wilke/scripts/xmlchange"
8dc2354 Merge pull request #363 from ESMCI/rljacob/machines/fix-acme
0773aac Increase default walltime for blues
c4dce0f Remove last uses of taskmaker.pl
76eb1bc Remove -A directive from edison
d67b267 Merge pull request #361 from billsacks/cism_nag
bf02e3e Merge pull request #357 from ekluzek/fixpionml
c9b8910 Seperate out modelio namelist definition since it uses the same names, but defines them differently
805ad7f Add -mismatch_all when compiling cism with nag
fff9a9b Set CHECK_TIMING to true in addition to SAVE_TIMING if --save-timing given to create_test
f25a518 wait_for_tests will now always specifically wait for the RUN phase
e19e72c Update drv buildnamelist test to work with cime5
db1538e Merge branch 'douglasjacobsen/add_lanl_machines' (PR #353)
893c6c6 Add support for LANL's mustang and wolf to cime
dea8a3a Merge pull request #350 from ESMCI/rerun_test_functionality
9b4488b Changes based on github feedback
c103e08 Add SAVE_TIMING_DIR for edison
98f95bb Merge pull request #341 from ESMCI/santos/fix-env-leakage
1b775e4 Fixes post-upstream-merge
ea97b56 Merge branch 'master' into rerun_test_functionality
1ca1b83 Merge pull request #348 from ESMCI/jayeshkrishna/machinefiles/get_acme_cime_dev_working_on_mira
252aea7 Merge pull request #343 from ESMCI/wilke/scripts/xmlchange
2db894f Add missing files
622b7d0 Complete
ba69385 progress
84000ee Fixing the runjob command for ACME on Mira
1a55232 Adding config for ERS_Ld3.ne30_g16_rx1.A test
d1df346 Error handling; check for correct length of key-value pair array after split
fa1cb49 listofsettings allways an array, test for length of array
a6a3d33 Changed number of expected positional arguments to 0 or 1 , warnings and debug statements
74d2100 checking for missing values in settings string from command line
cd350da Remove `GenericXML` check for env variables.
6ca6b59 progress
e7b334e Merge pull request #340 from ESMCI/douglasjacobsen/fix_test_template
43807a5 Add white space after batch directives in script templates
e061505 Merge pull request #337 from ESMCI/jgfouca/autosave_env_info
edc1671 Autosave environment information in case_setup.
33ce89b Merge pull request #336 from ESMCI/jgfouca/fix_create_test_not_catching_missing_project
c3f8f84 create_test was not failing the create_newcase phase when project info was missing
fef81df Merge pull request #335 from ESMCI/jgfouca/add_queue_option_to_create_test
731f8a0 Add ability to select queue to create_test and create_newcase
6f9613f Merge pull request #333 from ESMCI/jgfouca/even_more_sky_env_fixes
11877a2 Fix mismatch between MPI_PATH and the mpi module being loaded
d37e177 Merge pull request #322 from ESMCI/jgfouca/restore_good_python_version_error
ad18c34 Merge pull request #331 from ESMCI/jgfouca/reduce_output_from_check_input_data
0268cb7 Only report present files in debug mode
47b4216 Fix spelling mistake
dddd7f0 Merge pull request #327 from ESMCI/jgfouca/fix_more_sky_env_issues
07aeb52 Get cime_developer building again on skybridge.
0aceb94 Merge branch 'wilke/template/directives' (PR #324)
d8f331c moved batchdirectives to top of the template
dbbd3a1 Users should get a nice error when their python is too old
5eea798 Merge pull request #320 from ESMCI/jgfouca/fix_skybridge_env_issues
ca3e004 Fix skybridge environment problems, port to new SEMS modules
1a82a57 Merge pull request #318 from ESMCI/jgfouca/remove_sentinel_concept
e266dd9 Remove sentinel concept from jenkins_generic_job
aae8e30 comments for cime5.0.5
e20f807 Merge pull request #317 from billsacks/restore_lii
1967bb3 Restore LII test
c404dfa Merge remote-tracking branch 'origin/master'
0406b7d updates for cime5.0.4
7930cd7 Merge pull request #316 from ESMCI/jgfouca/update_code_checker
009f7a1 code_checker: Leverage .gitignore by using git ls-files instead of find to get list of ifles to check
c5555c7 Merge pull request #315 from Katetc/master
c4817d7 fix issue 314
7a493a0 remove multiple run lines from test file
c269ab9 Merge pull request #313 from ESMCI/jgfouca/correctly_report_problem_in_test
734e4b4 Rolling the Intel compiler back to v15.0.2
9a46a99 On batch systems, be sure to report that the problem is with wait_for_tests, not create_test
0e95d19 add some more info to README file
e9586c1 Changes required to support the new Hobart cluster configuration
cdb7805 update documentation for --xml options
98a0380 Merge pull request #307 from ESMCI/rljacob/tests/add-readme
46ff3c6 work on test rerunability
c5f2ae9 Merge pull request #308 from ESMCI/jgfouca/fix_check_input_bug
bddea72 document ERR test
8639ba1 create ability to run tests in same case more than once
b18580b Fix minor erroneous output bug in check_input_data
6f3e448 Add README back to Testcases
fd001a7 Add README back to SystemTests
929f05a Merge pull request #305 from ESMCI/jgfouca/advanced_profiling_tool
20c2e22 Make prof tools a bit more user friendly
1f9b49b Merge pull request #304 from jedwards4b/dynamic_system_test_dirs
0ba7f50 handle so that we dont have a list of test names to maintain
92a266e handle so that we dont have a list of test names to maintain
99f3d35 New tool 'advanced-py-prof'
4e6f7f5 initialize contents
3703d94 machine specific fixes for edison/cori/slurm systems
f6dc40a fixes git issue 303
7bacf48 repeat change for acme
503d5ad load system test directories dynamically based on paths in config_files.xml
b28ddff Merge pull request #302 from ESMCI/jgfouca/profiling_tool_etc
0e1553f Add a new tool for very simple python profiling
1eb2a56 Merge pull request #301 from jedwards4b/shell_commands_delete
2f78ec7 remove any existing shell_commands files from case before writing new ones
5f85c05 update changelog for tag
4ef656d pylint fixes
8624ece fix needed for scripts_regression_tests following PR298
6cc9110 These were supposed to be in PR296
d19afa9 Merge remote-tracking branch 'jedwards/testing_fixes' (PR #296)
b302243 Merge pull request #298 from ESMCI/jgfouca/restore_verbose
0a95b04 component_compare_test should fail if one of the components to be compared is not found"
9aa8683 Reintroduce verbose option into the refactored logging system
a796384 fix for lii test and response to review
d8c9b2e Merge pull request #286 from jedwards4b/buildnml_output_fix
1894e28 improve documetation of debug option, remove incorrect documentation of verbose option
21ef9ae remove whitespace surrounding test names
f6fdbb7 fix erp test update ChangeLog
b8b4723 Fix LII test
daa0c63 rename bisect unit test from acme to cime
4f1a079 move pecount code from create_test to create_newcase
24eb48f move clm include directory to prevent build confusion
1ee4449 add support for ascii testfile, allow multiple compilers in tests
d6c28b4 fix memleak test giving error if baseline not found
8ff5f19 Merge pull request #287 from ESMCI/nag_mismatch
a57e410 Remove `-mismatch_all` from NAG options in CESM.
5040500 output from buildnml scripts now prints

git-subtree-dir: cime
git-subtree-split: 12d2135
rljacob added a commit that referenced this issue Dec 12, 2016
Upgrade ACME's version of CIME from v2 to v5.1.

Introduce new python versions of all user-interface functions.
Introduce new xml-format for most case and config xml files.
Nearly all internal code has been converted to python.

Fixes #373
Fixes #967

[BFB] except for IG cases because of changes to vertical remapping.

Conflicts resolved by removing the old mpas-cice and mpas-o
buildnaml and commit beta0 version with necessary cime5 changes.
Removed old config_grid.xml
Also converted AVIC-L to use beta0 settings in cam
config_component.xml
rljacob added a commit that referenced this issue Feb 27, 2017
Upgrade ACME's version of CIME from v2 to v5.1.

Introduce new python versions of all user-interface functions.
Introduce new xml-format for most case and config xml files.
Nearly all internal code has been converted to python.

Fixes #373
Fixes #967

[BFB] except for IG cases because of changes to vertical remapping.

Conflicts resolved by removing the old mpas-cice and mpas-o
buildnaml and commit beta0 version with necessary cime5 changes.
Removed old config_grid.xml
Also converted AVIC-L to use beta0 settings in cam
config_component.xml
jgfouca pushed a commit that referenced this issue Feb 27, 2018
Upgrade ACME's version of CIME from v2 to v5.1.

Introduce new python versions of all user-interface functions.
Introduce new xml-format for most case and config xml files.
Nearly all internal code has been converted to python.

Fixes #373
Fixes #967

[BFB] except for IG cases because of changes to vertical remapping.

Conflicts resolved by removing the old mpas-cice and mpas-o
buildnaml and commit beta0 version with necessary cime5 changes.
Removed old config_grid.xml
Also converted AVIC-L to use beta0 settings in cam
config_component.xml
jgfouca pushed a commit that referenced this issue Mar 14, 2018
Upgrade ACME's version of CIME from v2 to v5.1.

Introduce new python versions of all user-interface functions.
Introduce new xml-format for most case and config xml files.
Nearly all internal code has been converted to python.

Fixes #373
Fixes #967

[BFB] except for IG cases because of changes to vertical remapping.

Conflicts resolved by removing the old mpas-cice and mpas-o
buildnaml and commit beta0 version with necessary cime5 changes.
Removed old config_grid.xml
Also converted AVIC-L to use beta0 settings in cam
config_component.xml
rljacob added a commit that referenced this issue Apr 16, 2021
Upgrade ACME's version of CIME from v2 to v5.1.

Introduce new python versions of all user-interface functions.
Introduce new xml-format for most case and config xml files.
Nearly all internal code has been converted to python.

Fixes #373
Fixes #967

[BFB] except for IG cases because of changes to vertical remapping.

Conflicts resolved by removing the old mpas-cice and mpas-o
buildnaml and commit beta0 version with necessary cime5 changes.
Removed old config_grid.xml
Also converted AVIC-L to use beta0 settings in cam
config_component.xml
rljacob added a commit that referenced this issue Apr 16, 2021
Upgrade ACME's version of CIME from v2 to v5.1.

Introduce new python versions of all user-interface functions.
Introduce new xml-format for most case and config xml files.
Nearly all internal code has been converted to python.

Fixes #373
Fixes #967

[BFB] except for IG cases because of changes to vertical remapping.

Conflicts resolved by removing the old mpas-cice and mpas-o
buildnaml and commit beta0 version with necessary cime5 changes.
Removed old config_grid.xml
Also converted AVIC-L to use beta0 settings in cam
config_component.xml
rljacob added a commit that referenced this issue May 6, 2021
Upgrade ACME's version of CIME from v2 to v5.1.

Introduce new python versions of all user-interface functions.
Introduce new xml-format for most case and config xml files.
Nearly all internal code has been converted to python.

Fixes #373
Fixes #967

[BFB] except for IG cases because of changes to vertical remapping.

Conflicts resolved by removing the old mpas-cice and mpas-o
buildnaml and commit beta0 version with necessary cime5 changes.
Removed old config_grid.xml
Also converted AVIC-L to use beta0 settings in cam
config_component.xml
rljacob added a commit that referenced this issue May 6, 2021
Upgrade ACME's version of CIME from v2 to v5.1.

Introduce new python versions of all user-interface functions.
Introduce new xml-format for most case and config xml files.
Nearly all internal code has been converted to python.

Fixes #373
Fixes #967

[BFB] except for IG cases because of changes to vertical remapping.

Conflicts resolved by removing the old mpas-cice and mpas-o
buildnaml and commit beta0 version with necessary cime5 changes.
Removed old config_grid.xml
Also converted AVIC-L to use beta0 settings in cam
config_component.xml
yunpengshan2014 pushed a commit that referenced this issue Apr 2, 2024
* start arm diags

* start arm diags

* annual cycle and diurnal cycle done

* viewer first working

* derived var for ARM obs

* hard coded viewer

* refine conv onset/read lat lon from data

* Add to derived variables

* finalize plots and viewer

* model vs model also works

* add postprocessing example

* add unit tests

* add unit test

* fix all_sets.cfg

* address review comments

* all tests pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants