Kill proc pool commands that take too long. #2659

hjoliver · 2018-05-10T11:37:16Z

Hung or long-running event handlers (etc.) can cause a suite to stall with tasks in the 'ready' state (i.e. with job submit commands queued but unable to execute because the max number of subprocesses is used up).
(I have some evidence that this has happened at BoM: a suite stalled with 'ready' tasks after reportedly spawning job polling commands that hung for some reason).

This kills subprocesses that are still running after 10 seconds. Easier to do than I expected. @cylc/core - thoughts? (should we do this; should it be configurable; is this approach OK?)

matthewrmshin · 2018-05-10T12:04:56Z

#2315 is the master issue. I don't think we can have a blanket 10s time out, because some external commands may take a while to get through (e.g. job logs retrieval, badly designed event handler, slow network, submitting jobs on a busy batch system, etc.)

The most correct approach is to expose the issue to users, and allow them to tell the suite to kill these commands via the client-server API, the CLI and/or the GUI.

ColemanTom · 2018-05-11T03:47:05Z

I'll throw out a couple of other ideas.

You could have an option in nodes that people could set. By default, it does nothing, but people could set a timeout if they want to.
Perhaps cylc could have a limit to the number of processes active before it refuses to spawn new tasks, polling periodically to see if it can spawn new ones. This could be definable in the global settings (% of max possible, defaulting to be no more than 90% of max processes running or something like that). Before spawning a task, it checks how close to the limit it is, and if its there, it doesn't spawn things? Or, by default, it doesn't do this, but if the system admin who sets cylc up wants to, this option exists?

hjoliver · 2018-05-11T04:42:36Z

The most correct approach is to expose the issue to users, and allow them to tell the suite to kill these commands via the client-server API, the CLI and/or the GUI.

Yes it would be good for users to have a "window" (via CLI and/or GUI) into what external command sub-processes are currently running, showing the command types (job submit, event handler, etc.) and command lines, along with how long they have been running and an option to kill them. We should put up an issue for this. [or add it to 2315]

However,

that may be a non-trivial job and we do have an immediate problem: if the process pool locks up with hung processes, the only solution right now is to kill the suite server program and do a restart.
I don't agree that this is the most correct approach for hung processes, which evidently do happen (and it's a distinct possibility with user-supplied event handlers). If that occurs in the dead of night, it would be best to eliminate the problem automatically, presumably by killing the process on a timeout. Otherwise, the suite stalls until morning..

I suggest (as per Tom's suggestion 1 above) we allow my kill on timeout as a global config option, and just document that if you use a timeout it had better be long enough to avoid killing important processes when the system loads up a bit. Does that seem reasonable?

hjoliver · 2018-05-11T04:47:59Z

@ColemanTom - I don't quite understand your second point, but you may be describing more or less what Cylc does already: it manages the number of spawned subprocesses to be <= N, where N is configurable but defaults to the number of cores on the suite host. Hence the problem here, if all N slots are held by hung processes.

ColemanTom · 2018-05-11T04:59:55Z

Yes, I wasn't sure how to word 2, but that is what I meant.

matthewrmshin · 2018-05-11T08:26:43Z

Hi @hjoliver I can now see where you are coming from. In which case, I have no objection to a default time out for all managed subprocesses by the suite server program. This should fix the original intention of #2315.

Quick brain dump:

The default value should be a bit higher than PT10S to avoid surprises, however. I think we should pick a value that is around 5-15 minutes? (This should handle the worse rsync job log retrieval commands?)
I think we should still have an API/CLI/GUI functionality to list and/or send signal to managed subprocesses.
Finally, we are still lacking an API functionality to tell the server program to relaunch selected task event handlers.

* invoke callback after kill * global config timeout, PT10M default

hjoliver · 2018-05-14T05:03:37Z

@matthewrmshin - made configurable and upped the timeout to PT10M, as suggested. Tests added. Can you post another issue for your ideas on API to give users more control over the process pool (which I think goes well beyond this PR, so long as the kill timeout is sufficiently long).

hjoliver · 2018-05-14T05:08:43Z

(Just to note, the motivation for this PR was to replace - and generalize to all pool processes - the function timeout mechanism that I had previously added to #2423, which got nuked by #2590.)

matthewrmshin · 2018-05-14T08:34:27Z

Tests OK in my environment.

matthewrmshin

Some quick comments.

matthewrmshin · 2018-05-14T08:40:04Z

lib/cylc/mp_pool.py

@@ -159,7 +163,14 @@ def process(self):
        for proc, ctx, callback, callback_args in self.runnings:
            ret_code = proc.poll()
            if ret_code is None:
-                runnings.append([proc, ctx, callback, callback_args])
+                if time.time() > ctx.init_time + self.cmd_timeout:
+                    os.killpg(proc.pid, SIGKILL)


Should protect this with a try ... except OSError block, just in case the process exits normally between line 165 and this line.

matthewrmshin · 2018-05-14T08:56:54Z

lib/cylc/mp_pool.py

+                    ctx.ret_code = 1
+                    ctx.err = "killed on timeout (%s)" % self.cmd_timeout
+                    self._run_command_exit(ctx, callback, callback_args)
+                    proc.wait()


Better this way?

os.killpg(proc.pid, SIGKILL) ctx.ret_code = proc.wait() ctx.out, ctx.err = ctx.communicate() ctx.err += '\nkilled on timeout (%s)' % self.cmd_timeout self._run_command_exit(ctx, callback, callback_args)

ctx.ret_code should then be set to the actual exit value. If the program has got some stderr and stdout, then they will be logged as well.

matthewrmshin · 2018-05-14T08:59:03Z

lib/cylc/mp_pool.py

@@ -177,6 +188,7 @@ def process(self):
            else:
                proc = self._run_command_init(ctx, callback, callback_args)
                if proc is not None:
+                    ctx.init_time = time.time()


I wonder if we should store the time out time here, so line 106 will only have to compare current time with the time out time, instead of have to do an addition before the comparison.

matthewrmshin · 2018-05-14T09:00:25Z

lib/cylc/mp_pool.py

@@ -159,7 +163,14 @@ def process(self):
        for proc, ctx, callback, callback_args in self.runnings:
            ret_code = proc.poll()
            if ret_code is None:
-                runnings.append([proc, ctx, callback, callback_args])
+                if time.time() > ctx.init_time + self.cmd_timeout:


Worth moving time.time() to before line 163 to reduce the number of system calls.

now = time.time() for proc, ctx, callback, callback_args in self.runnings: ret_code = proc.poll() if ret_code is None: if now > ctx.init_time + self.cmd_timeout:

matthewrmshin · 2018-05-14T12:00:08Z

lib/cylc/mp_pool.py

@@ -63,6 +64,8 @@ class SuiteProcContext(object):
            Return code of the command.
        .timestamp (str):
            Time string of latest update.
+        .proc_pool_timeout (int):


Should be .timeout (float)?

matthewrmshin

Good enough.

matthewrmshin · 2018-05-14T13:45:58Z

@oliver-sanders please sanity check.

Kill proc kill commands that take too long.

7130d22

hjoliver self-assigned this May 10, 2018

matthewrmshin added this to the soon milestone May 10, 2018

hjoliver added 3 commits May 14, 2018 12:56

Finish off external command timeout.

2fea42f

* invoke callback after kill * global config timeout, PT10M default

Added external command timeout tests.

3856ed3

Documented external command timeout.

be2f62c

hjoliver requested review from matthewrmshin and sadielbartholomew May 14, 2018 05:00

matthewrmshin reviewed May 14, 2018

View reviewed changes

Address review feedback.

da54570

matthewrmshin reviewed May 14, 2018

View reviewed changes

Address feedback: trivial comment update [skip ci].

1201e61

matthewrmshin approved these changes May 14, 2018

View reviewed changes

matthewrmshin requested review from oliver-sanders and removed request for sadielbartholomew May 14, 2018 13:45

oliver-sanders approved these changes May 16, 2018

View reviewed changes

oliver-sanders merged commit 77281b4 into cylc:master May 16, 2018

matthewrmshin modified the milestones: soon, next release May 16, 2018

hjoliver deleted the proc-pool-timeout branch May 16, 2018 09:25

This was referenced May 16, 2018

timeout for external commands? #2315

Closed

hell freezes over: cat-log self-tidy when re-invoked by ssh #2669

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kill proc pool commands that take too long. #2659

Kill proc pool commands that take too long. #2659

hjoliver commented May 10, 2018

matthewrmshin commented May 10, 2018

ColemanTom commented May 11, 2018

hjoliver commented May 11, 2018 •

edited

Loading

hjoliver commented May 11, 2018

ColemanTom commented May 11, 2018

matthewrmshin commented May 11, 2018

hjoliver commented May 14, 2018 •

edited

Loading

hjoliver commented May 14, 2018

matthewrmshin commented May 14, 2018

matthewrmshin left a comment

matthewrmshin May 14, 2018

matthewrmshin May 14, 2018

matthewrmshin May 14, 2018

matthewrmshin May 14, 2018

matthewrmshin May 14, 2018

hjoliver May 14, 2018

matthewrmshin left a comment

matthewrmshin commented May 14, 2018

Kill proc pool commands that take too long. #2659

Kill proc pool commands that take too long. #2659

Conversation

hjoliver commented May 10, 2018

matthewrmshin commented May 10, 2018

ColemanTom commented May 11, 2018

hjoliver commented May 11, 2018 • edited Loading

hjoliver commented May 11, 2018

ColemanTom commented May 11, 2018

matthewrmshin commented May 11, 2018

hjoliver commented May 14, 2018 • edited Loading

hjoliver commented May 14, 2018

matthewrmshin commented May 14, 2018

matthewrmshin left a comment

Choose a reason for hiding this comment

matthewrmshin May 14, 2018

Choose a reason for hiding this comment

matthewrmshin May 14, 2018

Choose a reason for hiding this comment

matthewrmshin May 14, 2018

Choose a reason for hiding this comment

matthewrmshin May 14, 2018

Choose a reason for hiding this comment

matthewrmshin May 14, 2018

Choose a reason for hiding this comment

hjoliver May 14, 2018

Choose a reason for hiding this comment

matthewrmshin left a comment

Choose a reason for hiding this comment

matthewrmshin commented May 14, 2018

hjoliver commented May 11, 2018 •

edited

Loading

hjoliver commented May 14, 2018 •

edited

Loading