Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making reload safer #5107

Closed
dpmatthews opened this issue Aug 26, 2022 · 10 comments · Fixed by #5592
Closed

Making reload safer #5107

dpmatthews opened this issue Aug 26, 2022 · 10 comments · Fixed by #5592
Assignees
Labels
question Flag this as a question for the next Cylc project meeting.
Milestone

Comments

@dpmatthews
Copy link
Contributor

re https://cylc.discourse.group/t/task-wrangling-issues/517/8
We need to make it obvious when reloads fail.

I think reload should:

  • pause the workflow (and possibly wait for any tasks in the preparing state to submit?)
  • reload the config (log what has changed?)
  • if the reload succeeds, release the workflow (unless the workflow was previously held)
  • if the reload fails, leave the workflow paused so the user has the choice to resume with the existing workflow definition or fix the problem and reload again

Also, what about restarts? If a server dies and you restart a workflow it's meant to continue unchanged. However, what happens if you've done a reinstall but not reloaded (and perhaps even have a workflow config which does not validate)?
I think restart should always use the previous config.
If this differs from the installed config this should be logged (if possible) but not loaded.
We could add a --reload option to cylc play to support the old behaviour.

@dpmatthews dpmatthews added the question Flag this as a question for the next Cylc project meeting. label Aug 26, 2022
@dpmatthews dpmatthews added this to the cylc-8.x milestone Aug 26, 2022
@oliver-sanders
Copy link
Member

and possibly wait for any tasks in the preparing state to submit?

We would need to flush through all tasks in the preparing state (otherwise we can't say for sure whether the job file was written with the old or new settings) whilst the workflow is paused (otherwise tasks will continue to enter the preparing state).

One downside of this is that a remote operation (e.g. remote-init) running during the reload process would cause the workflow to become unresponsive for that period. Without major changes there would be no way to "cancel" a reload once actioned.

@dpmatthews
Copy link
Contributor Author

I think restart should always use the previous config.

It should also not repeat any remote file installation.

@oliver-sanders
Copy link
Member

With the current approach to reload it's not especially clear what is and isn't supposed to change as a result of a reload.

I went through the code and pulled out the attributes which are and aren't reloaded:

Task Proxy Attributes

Reloaded (not explicitly preserved):

  • clock_trigger_time
  • expire_time
  • graph_children
  • is_late
  • late_time
  • non_unique_events
  • reload_successor
  • tdef
  • waiting_on_job_prep

Preserved:

  • flow_wait
  • identity (implicitly preserved as TaskDef cannot change name, point is preserved)
  • is_manual_submit
  • job_vacated
  • jobs
  • local_job_file_path
  • platform (Note: task_events_mgr looks at the task def not the .platform attribute for this)
  • poll_timer
  • state
  • submit_num
  • summary
  • timeout
  • tokens (implicitly preserved as identity is preserved)
  • try_timers
  • flow_nums (set in Task Pool)
  • point (set in Task Pool)
  • point_as_seconds (set in Task Pool)

Task State Attributes

Reloaded (not explicitly preserved):

  • _is_satisfied
  • _suicide_is_satisfied
  • external_triggers
  • is_queued
  • kill_failed
  • suicide_prerequisites
  • time_updated
  • xtriggers (_cylc xtriggers are preserved, others are permitted to change)

Preserved:

  • is_held
  • is_runahead
  • is_updated
  • outputs
  • prerequisites
  • status (set in Task Pool)

Main takeaway is that prereqs and outputs can't be changed by a reload. Since SoD this should probably be possible now. It makes sense to add or remove a prerequisite of a waiting task or add an output to a failed one?

@hjoliver
Copy link
Member

Well done, good to have that info laid out in full 👍

Main takeaway is that prereqs and outputs can't be changed by a reload.

Worth pointing out, for anyone reading this, we're only talking about reloading existing task proxies here - everything can be changed by reload if the task hasn't been spawned yet.

Since SoD this should probably be possible now. It makes sense to add or remove a prerequisite of a waiting task or add an output to a failed one?

Yeah, we should be able to do that.

@hjoliver
Copy link
Member

What should be preserved:

  • any attribute that relates to the history of this particular task proxy (jobs, submit number etc.)

What should be reloaded:

  • everything else

Any exceptions to that?

@hjoliver
Copy link
Member

Just realized I didn't comment on @dpmatthews suggestions at the top.

I like the reload suggestions 👍 And I particularly like the restart idea: always restart with the pre-restart config, so the user always has to reload an updated flow.cylc. (However, this will make it even more important to get reload right, as restart won't be a fall back any more.)

@oliver-sanders
Copy link
Member

For the record here's a portion of the diff I was working on
         for itask in tasks:
             if itask.tdef.name in self.orphans:
                 if (
-                        itask.state(TASK_STATUS_WAITING)
-                        or itask.state.is_held
-                        or itask.state.is_queued
+                    itask.state(TASK_STATUS_WAITING)
+                    or itask.state.is_held
+                    or itask.state.is_queued
                 ):
                     # Remove orphaned task if it hasn't started running yet.
                     self.remove(itask, 'task definition removed')
@@ -881,21 +881,31 @@ class TaskPool:
                         "- task definition removed"
                     )
             else:
-                new_task = TaskProxy(
-                    self.config.get_taskdef(itask.tdef.name),
-                    itask.point, itask.flow_nums, itask.state.status)
-                itask.copy_to_reload_successor(new_task)
-                self._swap_out(new_task)
-                LOG.info(f"[{itask}] reloaded task definition")
-                if itask.state(*TASK_STATUSES_ACTIVE):
+                if (
+                    itask.state(TASK_STATUS_PREPARING, *TASK_STATUSES_ACTIVE)
+                    or itask.waiting_on_job_prep  # pre preparation
+                ):
                     LOG.warning(
                         f"[{itask}] active with pre-reload settings"
                     )
-                elif itask.state(TASK_STATUS_PREPARING):
-                    # Job file might have been written at this point?
-                    LOG.warning(
-                        f"[{itask}] may be active with pre-reload settings"
-                    )
+                else:
+                    itask.reload(self.config.get_taskdef(itask.tdef.name))
+                    LOG.info(f"[{itask}] reloaded task definition")
+                # new_task = TaskProxy(
+                #     self.config.get_taskdef(itask.tdef.name),
+                #     itask.point, itask.flow_nums, itask.state.status)
+                # itask.copy_to_reload_successor(new_task)
+                # self._swap_out(new_task)
+                # LOG.info(f"[{itask}] reloaded task definition")
+                # if itask.state(*TASK_STATUSES_ACTIVE):
+                #     LOG.warning(
+                #         f"[{itask}] active with pre-reload settings"
+                #     )
+                # elif itask.state(TASK_STATUS_PREPARING):
+                #     # Job file might have been written at this point?
+                #     LOG.warning(
+                #         f"[{itask}] may be active with pre-reload settings"
+                #     )

diff --git a/cylc/flow/task_proxy.py b/cylc/flow/task_proxy.py
index b5511ffb1..7216993a4 100644
--- a/cylc/flow/task_proxy.py
+++ b/cylc/flow/task_proxy.py
@@ -266,6 +266,20 @@ class TaskProxy:
             f" flows:{','.join(str(i) for i in self.flow_nums) or 'none'}"
         )
 
+    def reload(self, tdef):
+        # swap out the task definition
+        self.tdef = tdef
+
+        # reload the task state (incl. triggers & outputs)
+        # self.state.reload(tdef, self.point)
+
+        # recompute derived values
+        self.get_late_time()
+        self.graph_children = generate_graph_children(tdef, self.point)
+
+        self.state.is_queued = False
+        self.state.is_updated = True
         for itask in tasks:
             if itask.tdef.name in self.orphans:
                 if (
-                        itask.state(TASK_STATUS_WAITING)
-                        or itask.state.is_held
-                        or itask.state.is_queued
+                    itask.state(TASK_STATUS_WAITING)
+                    or itask.state.is_held
+                    or itask.state.is_queued
                 ):
                     # Remove orphaned task if it hasn't started running yet.
                     self.remove(itask, 'task definition removed')
@@ -881,21 +881,31 @@ class TaskPool:
                         "- task definition removed"
                     )
             else:
-                new_task = TaskProxy(
-                    self.config.get_taskdef(itask.tdef.name),
-                    itask.point, itask.flow_nums, itask.state.status)
-                itask.copy_to_reload_successor(new_task)
-                self._swap_out(new_task)
-                LOG.info(f"[{itask}] reloaded task definition")
-                if itask.state(*TASK_STATUSES_ACTIVE):
+                if (
+                    itask.state(TASK_STATUS_PREPARING, *TASK_STATUSES_ACTIVE)
+                    or itask.waiting_on_job_prep  # pre preparation
+                ):
                     LOG.warning(
                         f"[{itask}] active with pre-reload settings"
                     )
-                elif itask.state(TASK_STATUS_PREPARING):
-                    # Job file might have been written at this point?
-                    LOG.warning(
-                        f"[{itask}] may be active with pre-reload settings"
-                    )
+                else:
+                    itask.reload(self.config.get_taskdef(itask.tdef.name))
+                    LOG.info(f"[{itask}] reloaded task definition")
+                # new_task = TaskProxy(
+                #     self.config.get_taskdef(itask.tdef.name),
+                #     itask.point, itask.flow_nums, itask.state.status)
+                # itask.copy_to_reload_successor(new_task)
+                # self._swap_out(new_task)
+                # LOG.info(f"[{itask}] reloaded task definition")
+                # if itask.state(*TASK_STATUSES_ACTIVE):
+                #     LOG.warning(
+                #         f"[{itask}] active with pre-reload settings"
+                #     )
+                # elif itask.state(TASK_STATUS_PREPARING):
+                #     # Job file might have been written at this point?
+                #     LOG.warning(
+                #         f"[{itask}] may be active with pre-reload settings"
+                #     )


diff --git a/cylc/flow/task_proxy.py b/cylc/flow/task_proxy.py
index b5511ffb1..7216993a4 100644
--- a/cylc/flow/task_proxy.py
+++ b/cylc/flow/task_proxy.py
@@ -266,6 +266,20 @@ class TaskProxy:
             f" flows:{','.join(str(i) for i in self.flow_nums) or 'none'}"
         )
 
+    def reload(self, tdef):
+        # swap out the task definition
+        self.tdef = tdef
+
+        # recompute derived values
+        self.get_late_time()
+        self.graph_children = generate_graph_children(tdef, self.point)
+
+        self.state.is_queued = False
+        self.state.is_updated = True

Fairly straight forward, this change should simplify things a fair bit. The remaining difficulty is the reloading of outputs / prerequisites which turned out to be too complicated to hash out quickly for 8.0.x (too complicated and too risky too).

@oliver-sanders
Copy link
Member

See also #5579, we should probably wait for preparing tasks to clear before pausing the workflow.

@oliver-sanders oliver-sanders self-assigned this Jun 19, 2023
@oliver-sanders oliver-sanders modified the milestones: cylc-8.x, cylc-8.2.0 Jun 19, 2023
@oliver-sanders
Copy link
Member

I'm taking a look into flushing out preparing tasks and pausing the workflow pre-reload as a means to fixing #5579. If successful I will punt the remainder of this issue (to do with the interaction of restart and reload) to another issue for later consideration.

oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jun 19, 2023
* Closes cylc#5107
* Reload now waits for pending tasks to submit before attempting to
  reload the config itself.
* Reload now also puts the workflow into the paused state during the
  reload process. This doesn't actually achieve anything as the
  reload command is blocking in the main loop, but it does help to
  communicate that the workflow will not de-queue or submit and new
  tasks during this process.
* The workflow status message is now updated to reflect the reload
  progress.
@oliver-sanders
Copy link
Member

Bumped the cylc play --reload stuff to #5591

oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jun 19, 2023
* Closes cylc#5107
* Reload now waits for pending tasks to submit before attempting to
  reload the config itself.
* Reload now also puts the workflow into the paused state during the
  reload process. This doesn't actually achieve anything as the
  reload command is blocking in the main loop, but it does help to
  communicate that the workflow will not de-queue or submit and new
  tasks during this process.
* The workflow status message is now updated to reflect the reload
  progress.
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jun 19, 2023
* Closes cylc#5107
* Reload now waits for pending tasks to submit before attempting to
  reload the config itself.
* Reload now also puts the workflow into the paused state during the
  reload process. This doesn't actually achieve anything as the
  reload command is blocking in the main loop, but it does help to
  communicate that the workflow will not de-queue or submit and new
  tasks during this process.
* The workflow status message is now updated to reflect the reload
  progress.
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jun 20, 2023
* Closes cylc#5107
* Reload now waits for pending tasks to submit before attempting to
  reload the config itself.
* Reload now also puts the workflow into the paused state during the
  reload process. This doesn't actually achieve anything as the
  reload command is blocking in the main loop, but it does help to
  communicate that the workflow will not de-queue or submit and new
  tasks during this process.
* The workflow status message is now updated to reflect the reload
  progress.
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jun 27, 2023
* Closes cylc#5107
* Reload now waits for pending tasks to submit before attempting to
  reload the config itself.
* Reload now also puts the workflow into the paused state during the
  reload process. This doesn't actually achieve anything as the
  reload command is blocking in the main loop, but it does help to
  communicate that the workflow will not de-queue or submit and new
  tasks during this process.
* The workflow status message is now updated to reflect the reload
  progress.
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jun 27, 2023
* Closes cylc#5107
* Reload now waits for pending tasks to submit before attempting to
  reload the config itself.
* Reload now also puts the workflow into the paused state during the
  reload process. This doesn't actually achieve anything as the
  reload command is blocking in the main loop, but it does help to
  communicate that the workflow will not de-queue or submit and new
  tasks during this process.
* The workflow status message is now updated to reflect the reload
  progress.
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jun 27, 2023
* Closes cylc#5107
* Reload now waits for pending tasks to submit before attempting to
  reload the config itself.
* Reload now also puts the workflow into the paused state during the
  reload process. This doesn't actually achieve anything as the
  reload command is blocking in the main loop, but it does help to
  communicate that the workflow will not de-queue or submit and new
  tasks during this process.
* The workflow status message is now updated to reflect the reload
  progress.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Flag this as a question for the next Cylc project meeting.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants