Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert ERP test to SystemTestsCompareTwo infrastructure? #448

Closed
billsacks opened this issue Aug 24, 2016 · 26 comments
Closed

Convert ERP test to SystemTestsCompareTwo infrastructure? #448

billsacks opened this issue Aug 24, 2016 · 26 comments
Assignees

Comments

@billsacks
Copy link
Member

We want to convert the various system tests that compare two runs to use the SystemTestsCompareTwo infrastructure, where possible. I am opening separate issues for each such system test, so that we can check them off as we do this conversion.

The ERP test would be less straightforward to convert to this infrastructure, since currently it makes use of setting CONTINUE_RUN which doesn't work if we're using a clone-based test. On the other hand, the criticality and complexity of the ERP test argue for translating it to this new infrastructure, which will make it more maintainable and less error-prone.

@mvertens suggests changing its behavior so that, rather than doing a CONTINUE_RUN, it instead uses a branch run for the second case. For this to work, we'd need to introduce a new hook into the SystemTestsCompareTwo infrastructure: a _pre_run_two hook that is executed between the case1 run and the case2 run. For the ERP test, this would need to copy (or link) the relevant restart files from the case1 run directory to the case2 run directory.

@mfdeakin-sandia
Copy link
Contributor

mfdeakin-sandia commented Nov 16, 2016

What are the relevant restart files, or how can I find out their names? ERP.run_phase copies some XML files to the LockedFiles directory and what looks like the CESM/ACME executable, but doesn't copy anything which looks like a restart file. It does call case_setup again, but that doesn't seem related either

@jedwards4b
Copy link
Contributor

I suggest that we run the st_archive on case1 and retrieve them from the archive directory for case 2. The st_archiver already knows what files are required.

@mfdeakin-sandia
Copy link
Contributor

Alright, I'll look into how I can do that. I assume it's not going to be as simple as the hacky run_cmd_no_fail("case.st_archive")?

@jedwards4b
Copy link
Contributor

not as hacky as that just need to set st_archive argument in run_indv(self, suffix="base", st_archive=False) to True. You would do this for the first run and then retreve all of the files from the archive restart directory into case2 run directory.

@mfdeakin-sandia
Copy link
Contributor

Hmm. I was hoping to be able to do this all in the _pre_run_hook method @billsacks and @mvertens mentioned. I guess adding more default constructor parameters for SystemTestsCompareTwo will work for this, though.

@jedwards4b
Copy link
Contributor

system_tests_compare_two.py isn't calling run_indv with all the available arguments, i think that you just need to add that st_archive argument there to make it available to tests using that method.

@billsacks
Copy link
Member Author

I'm a bit hesitant about using the short-term archiver here: The ERP test already does so much that it can be challenging to figure out what went wrong when it fails. Adding the short-term archiver into the mix adds an additional potential failure mode. Or multiple new failure modes if this is done via an additional job... I can't tell if that's what you're suggesting here, but I would at least vote against having this be done via multiple batch jobs if possible.

On the other hand, I see @jedwards4b 's point that that might be the easiest way to be sure you're picking up all the necessary restart files.

The one other idea that comes to mind is refactoring the short-term archiver to extract a clean function that gives you the list of all restart files. It looks like this function would contain some of what's currently in case_st_archive.py: _archive_restarts. Then this function could be called by the short-term archiver. In addition, it could be called directly by the ERP test to get the list of restart files that should be copied (or linked) into the second case.

All that said: I guess I'm okay with sticking with the idea of using the short-term archiver in the ERP test, as long as it doesn't require submitting a separate job.

@jedwards4b
Copy link
Contributor

I would override the run_phase method from system_tests_compare_two with one that calls run_indv with the st_archive flag set for the first run and then copies (or links) the files from the archive restart directory into the case2 run directory before starting run2. This is the only local method that this class should need, all the rest can be inherited.

@billsacks
Copy link
Member Author

I wouldn't override run_phase: the design is that specific tests should fill in hooks provided by this method, but should not override this method: If you override this method, you'll end up needing to copy & paste into erp.py, which will be a maintenance problem, and could lead to unintentional divergence.

But I see @jedwards4b 's overall point. I would just instead do this by:

  1. Adding new arguments to the SystemTestsCompareTwo constructor saying whether st_archive should be done for case1 and case2. i.e., this would look like:

      def __init__(self,
                   case,
                   separate_builds,
                   run_one_st_archive = False,
                   run_two_st_archive = False,
                   run_two_suffix = 'test',
                   run_one_description = '',
                   run_two_description = ''):

    and then in the run_phase method:

          self.run_indv(suffix = self._run_one_suffix, st_archive = run_one_st_archive)

    and similarly for case2.

  2. Add a _pre_run2 hook. By default (i.e., in SystemTestsCompareTwo) this would just be implemented as 'pass' (similar to _common_setup). The erp test could override this to do the necessary work.

@mfdeakin-sandia
Copy link
Contributor

I've set it up to run the short term archiver (the same way Bill prefered and I alluded to) for case one, but I'm unfortunately not even getting to that point in run_indv. Instead, the runs are failing in _component_compare_copy, as it doesn't seem to be finding any history files. What part of the run can I look at to see why they aren't being created?

@jgfouca
Copy link
Contributor

jgfouca commented Nov 17, 2016

hist_utils has routines that find history files, look at what's in the directory and find the history files then see why the filenames are not being recognized.

@mfdeakin-sandia
Copy link
Contributor

I've been looking at hist_utils, and it's looking for files with the regex suffix ".h?.nc". There are no such files in the run directory. The only netcdf files created are "run/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_110448_8lees5.cpl.r.0001-01-04-00000.nc" and "run/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_110448_8lees5.dice.r.0001-01-04-00000.nc"

@jgfouca
Copy link
Contributor

jgfouca commented Nov 17, 2016

OK, then the test is misconfigured and it's not running long enough to make a hist file.

@jgfouca
Copy link
Contributor

jgfouca commented Nov 17, 2016

You'll need to look at settings like HIST_OPTION, HIST_N, STOP_OPTION, STOP_N

@mfdeakin-sandia
Copy link
Contributor

That does seem to be the case. I thought I had specified all of the configuration options, but it looks like some were hidden for some reason.

@jgfouca
Copy link
Contributor

jgfouca commented Nov 17, 2016

Look at the config_tests.xml file.

@mfdeakin-sandia
Copy link
Contributor

mfdeakin-sandia commented Nov 17, 2016

I was :), but the only difference between it and master is that I added <.. SMP_BUILD >0</...>. I also matched (or attempted to) the settings done in the run and build phases of the old ERP test. So why else would this change?

@jgfouca
Copy link
Contributor

jgfouca commented Nov 17, 2016

Not sure, it's possible the test has been broken all along. you may want to confirm that

@billsacks
Copy link
Member Author

The ERP test is missing this in config_tests.xml:

    <HIST_OPTION>$STOP_OPTION</HIST_OPTION>
    <HIST_N>$STOP_N</HIST_N>

Take a look at some other tests (NCK, PEM, etc.) for an example.

@billsacks
Copy link
Member Author

It looks like these things had been set in erp.py:

                self._case.set_value("REST_N", rest_n)
                self._case.set_value("REST_OPTION", stop_option)
                self._case.set_value("HIST_N", stop_n)
                self._case.set_value("HIST_OPTION", stop_option)
                self._case.set_value("CONTINUE_RUN", False)

REST_OPTION & REST_N are specific to the erp test, so should be kept in the python. HIST_N, HIST_OPTION and CONTINUE_RUN should be set in config_tests.xml, following what we're moving to for other tests.

@mfdeakin-sandia
Copy link
Contributor

Ah, I did miss them. That explains my confusion.

@mfdeakin-sandia
Copy link
Contributor

Alright, I have the short term archive files from the archive directory. Do I need to copy all of the netcdf files in the tree, or just the ones in the hist directories? And where in the second runs case directory do I put them?

@jedwards4b
Copy link
Contributor

Under the rest directory there will be a directory with a date format 0001-01-03-00000 or something like that - you want all the files from that directory to go into the second case rundir.

@mfdeakin-sandia
Copy link
Contributor

I think I've gotten everything working. Just to verify, these are the files copied and where I'm copying them to:

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/rpointer.rof
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/rpointer.rof

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/rpointer.ice
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/rpointer.ice

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.dice.r.0001-01-04-00000.nc
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.dice.r.0001-01-04-00000.nc

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/rpointer.ocn
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/rpointer.ocn

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.cpl.r.0001-01-04-00000.nc
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.cpl.r.0001-01-04-00000.nc

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.docn.rs1.0001-01-04-00000.bin
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.docn.rs1.0001-01-04-00000.bin

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/rpointer.atm
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/rpointer.atm

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.datm.rs1.0001-01-04-00000.bin
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.datm.rs1.0001-01-04-00000.bin

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/rpointer.drv
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/rpointer.drv

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.dice.rs1.0001-01-04-00000.bin
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.dice.rs1.0001-01-04-00000.bin

/home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.drof.rs1.0001-01-04-00000.bin
/home/mdeakin/acme/scratch/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.test/run/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.drof.rs1.0001-01-04-00000.bin

@jedwards4b
Copy link
Contributor

jedwards4b commented Nov 17, 2016

Keep in mind that the exact set of files depend on the compset and the exact name of the rest directory depends on the time-date stamp. Having said that you should only expect a single subdirectory under rest and if you copy all of the files from that subdirectory to the run directory of the second case you should have everything you need.

@jedwards4b
Copy link
Contributor

PR #824 determined that this was an ill-concieved issue and we should not refactor this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants