-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve carryover dependency handling #623
Conversation
58af44a
to
4b9660f
Compare
Test output:
That gives the following dependencies: e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.settings:
e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1987-1988_vs_1985-1986.settings:
global_time_series_1985-1995.settings:
ilamb_1985-1988.settings:
mpas_analysis_ts_1985-1989_climo_1985-1989.settings:
mpas_analysis_ts_1985-1995_climo_1990-1995.settings:
tc_analysis_1985-1986.settings:
tc_analysis_1987-1988.settings:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chengzhuzhang I've made some comments to better explain this PR. Thanks.
zppy/e3sm_diags.py
Outdated
@@ -33,10 +33,12 @@ def e3sm_diags(config, scriptDir, existing_bundles, job_ids_file): # noqa: C901 | |||
return existing_bundles | |||
|
|||
# --- Generate and submit e3sm_diags scripts --- | |||
dependencies: List[str] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is ultimately why you were running into your second problem on #622. The dependencies list is defined before cycling through the tasks. That is, dependencies accumulate rather than start fresh with each task.
That was actually somewhat intentional, because some tasks (notably MPAS-Analysis) relies on previous runs of itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@forsyth2 Thank for looking into this. To start fresh dependencies list for each e3sm_diags
sub tasks, we can just reset dependencies = []
, when looping over tasks, i.e. under line 38 in e3sm_diags.py without touching other part of zppy. I'm not sure why it is needed to introduce another parameter called carried_over_dependencies
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chengzhuzhang we can't start a completely fresh dependencies list on each new task because there is a case where we do want to "remember" dependencies from the previous one. This happens when tc_analysis
is a dependency.
In https://github.com/E3SM-Project/zppy/blob/main/zppy/e3sm_diags.py:
# Due to a `socket.gaierror: [Errno -2] Name or service not known` error when running e3sm_diags with tc_analysis
# on multiple year_sets, if tc_analysis is in sets, then e3sm_diags should be run sequentially.
if "tc_analysis" in c["sets"]:
# Note that this line should still be executed even if jobid == -1
# The later tc_analysis-using e3sm_diags tasks still depend on this task (and thus will also fail).
# Add to the dependency list
dependencies.append(statusFile)
A socket error occurs if we try to run e3sm_diags
in parallel on tc_analysis
. To work around that, we simply make e3sm_diags
run sequentially when it's working with tc_analysis
. (dependencies.append(statusFile)
adds the current task to the dependencies for the next task -- this is the point of carried_over_dependencies
). This is similar to how we make mpas_analysis
run sequentially.
The unfortunate side effect is that e3sm_diags
tasks that do not deal with tc_analysis
also end up caught in the sequential dependency chaining. And that means they don't run if an earlier task fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how tc_analysis
dependency works differently than other dependencies such as climo and time-series. I'm interested in reproducing the socket error
. To try to run e3sm_diags in parallel on tc_analysis
, does it mean, when tc_analysis task is on, while tc_analysis
set is not set to be one of e3sm_diags
sets to run, so that tc_analysis
and e3sm-diags
run in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how
tc_analysis
dependency works differently than other dependencies such as climo and time-series.
In this pull request, it's this line in the task we're finishing up
carried_over_dependencies.append(statusFile)
and this line in the next task:
dependencies: List[str] = carried_over_dependencies
This resets all other dependencies, except for the one we specifically added to carried_over_dependencies
(i.e., the last task).
I'm interested in reproducing the
socket error
This error occurs intermittently when running two e3sm_diags
jobs, both of which run the tc_analysis
set. The sequential workaround was introduced in #169. A full discussion of the tc_analysis
concurrency/parallelism bugs can be found in #180.
zppy/e3sm_diags.py
Outdated
@@ -267,7 +271,7 @@ def e3sm_diags(config, scriptDir, existing_bundles, job_ids_file): # noqa: C901 | |||
# Note that this line should still be executed even if jobid == -1 | |||
# The later tc_analysis-using e3sm_diags tasks still depend on this task (and thus will also fail). | |||
# Add to the dependency list | |||
dependencies.append(statusFile) | |||
carried_over_dependencies.append(statusFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, instead of passing all dependencies on to the next task, only specific tasks (i.e., the current task) will be propagated forward as dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the bug:
While carried_over_dependencies
works for mpas_analysis
and tc_analysis
, there still seems to be some trouble for e3sm_diags
.
From an earlier comment on this PR:
e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1987-1988_vs_1985-1986.settings:
- climo_atm_monthly_180x360_aave_1987-1988.status [BUG: carried over]
- tc_analysis_1987-1988.status [BUG: carried over]
- e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.status [BUG; it looks like this is marked as a carryover dependency because e3sm_diags.py is expecting another model_vs_obs after that uses tc_analysis, e.g., 1989-1990, but the mvm run here instead gets stuck with the dependency.]
- climo_atm_monthly_180x360_aave_1987-1988.status
I'm not entirely sure why the first two are showing up as dependencies when they're not carried over. I need to investigate the logic to see if zppy is picking them up as dependencies for e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.
As for e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988 itself, as mentioned in the quote block above, the issue is that the presence of tc_analysis
is telling zppy that we should run diags
sequentially. But we don't actually need to do that because the next task isn't doing any work with tc_analysis
. Unfortunately, there's no way to know a priori if the next task will be doing that or not. (We'd need to do a preemptive search of later tasks).
zppy/mpas_analysis.py
Outdated
@@ -133,7 +139,7 @@ def mpas_analysis(config, scriptDir, existing_bundles, job_ids_file): | |||
# Note that this line should still be executed even if jobid == -1 | |||
# The later MPAS-Analysis tasks still depend on this task (and thus will also fail). | |||
# Add to the dependency list | |||
dependencies.append(statusFile) | |||
carried_over_dependencies.append(statusFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The carried_over_dependencies
switch works for MPAS-Analysis.
zppy/tc_analysis.py
Outdated
@@ -88,7 +93,7 @@ def tc_analysis(config, scriptDir, existing_bundles, job_ids_file): | |||
# Note that this line should still be executed even if jobid == -1 | |||
# The later tc_analysis tasks still depend on this task (and thus will also fail). | |||
# Add to the dependency list | |||
dependencies.append(statusFile) | |||
carried_over_dependencies.append(statusFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The carried_over_dependencies
switch works for tc_analysis
.
7851767
to
c90c4a8
Compare
zppy/e3sm_diags.py
Outdated
@@ -99,7 +103,7 @@ def e3sm_diags(config: ConfigObj, script_dir: str, existing_bundles, job_ids_fil | |||
# Note that this line should still be executed even if jobid == -1 | |||
# The later tc_analysis-using e3sm_diags tasks still depend on this task (and thus will also fail). | |||
# Add to the dependency list | |||
dependencies.append(status_file) | |||
carried_over_dependencies.append(status_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rather than carried_over_dependencies
we have a variable previous_status_file
and then if "tc_analysis" in c["sets"]
we add that to the dependencies as well. Then, diags not using tc_analysis wouldn't be affected.
Rebased after the modularizing refactor of #628. Merge conflicts affected:
and have now been resolved. It looks like the now 22 unit tests are passing. I had to commit with
which I assume is related to #627. |
Yep, fresh-install worked. Thanks @xylar |
c5da603
to
84f8e2c
Compare
for c in tasks: | ||
|
||
dependencies: List[str] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tasks that weren't changed in this PR already have dependencies
reset inside the for c in tasks
loop
for c in tasks: | ||
|
||
dependencies: List[str] = carried_over_dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mpas_analysis
doesn't have any other dependencies besides previous runs of itself, so we could just move this definition out of the for c in tasks
loop, but I think it's more comprehensible to explicitly note the dependency on previous runs is being carried over, via the name carried_over_dependencies
The unit tests pass and I ran |
For reference, the dependency output from that min-case test:
We can then find the following:
|
84f8e2c
to
5d388cf
Compare
Issue resolution
tc_analysis
in zppy #622. However, Make TC Analysis parallel #631 ended up closing that by removing the carryover dependency requirement ine3sm_diags
entirely. That said, [Bug]: Issues withtc_analysis
in zppy #622 did reveal discrepancies in how dependencies were reset (or not) amongst the different tasks. This PR creates consistent dependency handling amongst the tasks.Select one: This pull request is...
1. Does this do what we want it to do?
Objectives:
.settings
file to ease debugging (i.e., it will be obvious to tell what zppy thought the dependencies were)Required:
tc_analysis
in zppy #622.2. Are the implementation details accurate & efficient?
Required:
3. Is this well documented?
Required:
4. Is this code clean?
Required: