-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert ERP test to SystemTestsCompareTwo infrastructure? #448
Comments
What are the relevant restart files, or how can I find out their names? ERP.run_phase copies some XML files to the LockedFiles directory and what looks like the CESM/ACME executable, but doesn't copy anything which looks like a restart file. It does call case_setup again, but that doesn't seem related either |
I suggest that we run the st_archive on case1 and retrieve them from the archive directory for case 2. The st_archiver already knows what files are required. |
Alright, I'll look into how I can do that. I assume it's not going to be as simple as the hacky run_cmd_no_fail("case.st_archive")? |
not as hacky as that just need to set st_archive argument in run_indv(self, suffix="base", st_archive=False) to True. You would do this for the first run and then retreve all of the files from the archive restart directory into case2 run directory. |
Hmm. I was hoping to be able to do this all in the _pre_run_hook method @billsacks and @mvertens mentioned. I guess adding more default constructor parameters for SystemTestsCompareTwo will work for this, though. |
system_tests_compare_two.py isn't calling run_indv with all the available arguments, i think that you just need to add that st_archive argument there to make it available to tests using that method. |
I'm a bit hesitant about using the short-term archiver here: The ERP test already does so much that it can be challenging to figure out what went wrong when it fails. Adding the short-term archiver into the mix adds an additional potential failure mode. Or multiple new failure modes if this is done via an additional job... I can't tell if that's what you're suggesting here, but I would at least vote against having this be done via multiple batch jobs if possible. On the other hand, I see @jedwards4b 's point that that might be the easiest way to be sure you're picking up all the necessary restart files. The one other idea that comes to mind is refactoring the short-term archiver to extract a clean function that gives you the list of all restart files. It looks like this function would contain some of what's currently in case_st_archive.py: _archive_restarts. Then this function could be called by the short-term archiver. In addition, it could be called directly by the ERP test to get the list of restart files that should be copied (or linked) into the second case. All that said: I guess I'm okay with sticking with the idea of using the short-term archiver in the ERP test, as long as it doesn't require submitting a separate job. |
I would override the run_phase method from system_tests_compare_two with one that calls run_indv with the st_archive flag set for the first run and then copies (or links) the files from the archive restart directory into the case2 run directory before starting run2. This is the only local method that this class should need, all the rest can be inherited. |
I wouldn't override run_phase: the design is that specific tests should fill in hooks provided by this method, but should not override this method: If you override this method, you'll end up needing to copy & paste into erp.py, which will be a maintenance problem, and could lead to unintentional divergence. But I see @jedwards4b 's overall point. I would just instead do this by:
|
I've set it up to run the short term archiver (the same way Bill prefered and I alluded to) for case one, but I'm unfortunately not even getting to that point in run_indv. Instead, the runs are failing in _component_compare_copy, as it doesn't seem to be finding any history files. What part of the run can I look at to see why they aren't being created? |
hist_utils has routines that find history files, look at what's in the directory and find the history files then see why the filenames are not being recognized. |
I've been looking at hist_utils, and it's looking for files with the regex suffix ".h?.nc". There are no such files in the run directory. The only netcdf files created are "run/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_110448_8lees5.cpl.r.0001-01-04-00000.nc" and "run/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_110448_8lees5.dice.r.0001-01-04-00000.nc" |
OK, then the test is misconfigured and it's not running long enough to make a hist file. |
You'll need to look at settings like HIST_OPTION, HIST_N, STOP_OPTION, STOP_N |
That does seem to be the case. I thought I had specified all of the configuration options, but it looks like some were hidden for some reason. |
Look at the config_tests.xml file. |
I was :), but the only difference between it and master is that I added <.. SMP_BUILD >0</...>. I also matched (or attempted to) the settings done in the run and build phases of the old ERP test. So why else would this change? |
Not sure, it's possible the test has been broken all along. you may want to confirm that |
The ERP test is missing this in config_tests.xml: <HIST_OPTION>$STOP_OPTION</HIST_OPTION>
<HIST_N>$STOP_N</HIST_N> Take a look at some other tests (NCK, PEM, etc.) for an example. |
It looks like these things had been set in erp.py: self._case.set_value("REST_N", rest_n)
self._case.set_value("REST_OPTION", stop_option)
self._case.set_value("HIST_N", stop_n)
self._case.set_value("HIST_OPTION", stop_option)
self._case.set_value("CONTINUE_RUN", False) REST_OPTION & REST_N are specific to the erp test, so should be kept in the python. HIST_N, HIST_OPTION and CONTINUE_RUN should be set in config_tests.xml, following what we're moving to for other tests. |
Ah, I did miss them. That explains my confusion. |
Alright, I have the short term archive files from the archive directory. Do I need to copy all of the netcdf files in the tree, or just the ones in the hist directories? And where in the second runs case directory do I put them? |
Under the rest directory there will be a directory with a date format 0001-01-03-00000 or something like that - you want all the files from that directory to go into the second case rundir. |
I think I've gotten everything working. Just to verify, these are the files copied and where I'm copying them to: /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/rpointer.rof /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/rpointer.ice /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.dice.r.0001-01-04-00000.nc /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/rpointer.ocn /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.cpl.r.0001-01-04-00000.nc /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.docn.rs1.0001-01-04-00000.bin /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/rpointer.atm /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.datm.rs1.0001-01-04-00000.bin /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/rpointer.drv /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.dice.rs1.0001-01-04-00000.bin /home/mdeakin/acme/scratch/archive/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz/rest/0001-01-04-00000/ERP_Ld3.ne16_ne16.A.sandia-srn-sems_gnu.20161117_145035_vb78lz.drof.rs1.0001-01-04-00000.bin |
Keep in mind that the exact set of files depend on the compset and the exact name of the rest directory depends on the time-date stamp. Having said that you should only expect a single subdirectory under rest and if you copy all of the files from that subdirectory to the run directory of the second case you should have everything you need. |
PR #824 determined that this was an ill-concieved issue and we should not refactor this test. |
We want to convert the various system tests that compare two runs to use the SystemTestsCompareTwo infrastructure, where possible. I am opening separate issues for each such system test, so that we can check them off as we do this conversion.
The ERP test would be less straightforward to convert to this infrastructure, since currently it makes use of setting CONTINUE_RUN which doesn't work if we're using a clone-based test. On the other hand, the criticality and complexity of the ERP test argue for translating it to this new infrastructure, which will make it more maintainable and less error-prone.
@mvertens suggests changing its behavior so that, rather than doing a CONTINUE_RUN, it instead uses a branch run for the second case. For this to work, we'd need to introduce a new hook into the SystemTestsCompareTwo infrastructure: a _pre_run_two hook that is executed between the case1 run and the case2 run. For the ERP test, this would need to copy (or link) the relevant restart files from the case1 run directory to the case2 run directory.
The text was updated successfully, but these errors were encountered: