Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill proc pool commands that take too long. #2659

Merged
merged 6 commits into from
May 16, 2018

Conversation

hjoliver
Copy link
Member

Hung or long-running event handlers (etc.) can cause a suite to stall with tasks in the 'ready' state (i.e. with job submit commands queued but unable to execute because the max number of subprocesses is used up).
(I have some evidence that this has happened at BoM: a suite stalled with 'ready' tasks after reportedly spawning job polling commands that hung for some reason).

This kills subprocesses that are still running after 10 seconds. Easier to do than I expected. @cylc/core - thoughts? (should we do this; should it be configurable; is this approach OK?)

@hjoliver hjoliver self-assigned this May 10, 2018
@matthewrmshin matthewrmshin added this to the soon milestone May 10, 2018
@matthewrmshin
Copy link
Contributor

#2315 is the master issue. I don't think we can have a blanket 10s time out, because some external commands may take a while to get through (e.g. job logs retrieval, badly designed event handler, slow network, submitting jobs on a busy batch system, etc.)

The most correct approach is to expose the issue to users, and allow them to tell the suite to kill these commands via the client-server API, the CLI and/or the GUI.

@ColemanTom
Copy link
Contributor

I'll throw out a couple of other ideas.

  1. You could have an option in nodes that people could set. By default, it does nothing, but people could set a timeout if they want to.
  2. Perhaps cylc could have a limit to the number of processes active before it refuses to spawn new tasks, polling periodically to see if it can spawn new ones. This could be definable in the global settings (% of max possible, defaulting to be no more than 90% of max processes running or something like that). Before spawning a task, it checks how close to the limit it is, and if its there, it doesn't spawn things? Or, by default, it doesn't do this, but if the system admin who sets cylc up wants to, this option exists?

@hjoliver
Copy link
Member Author

hjoliver commented May 11, 2018

The most correct approach is to expose the issue to users, and allow them to tell the suite to kill these commands via the client-server API, the CLI and/or the GUI.

Yes it would be good for users to have a "window" (via CLI and/or GUI) into what external command sub-processes are currently running, showing the command types (job submit, event handler, etc.) and command lines, along with how long they have been running and an option to kill them. We should put up an issue for this. [or add it to 2315]

However,

  1. that may be a non-trivial job and we do have an immediate problem: if the process pool locks up with hung processes, the only solution right now is to kill the suite server program and do a restart.
  2. I don't agree that this is the most correct approach for hung processes, which evidently do happen (and it's a distinct possibility with user-supplied event handlers). If that occurs in the dead of night, it would be best to eliminate the problem automatically, presumably by killing the process on a timeout. Otherwise, the suite stalls until morning..

I suggest (as per Tom's suggestion 1 above) we allow my kill on timeout as a global config option, and just document that if you use a timeout it had better be long enough to avoid killing important processes when the system loads up a bit. Does that seem reasonable?

@hjoliver
Copy link
Member Author

@ColemanTom - I don't quite understand your second point, but you may be describing more or less what Cylc does already: it manages the number of spawned subprocesses to be <= N, where N is configurable but defaults to the number of cores on the suite host. Hence the problem here, if all N slots are held by hung processes.

@ColemanTom
Copy link
Contributor

Yes, I wasn't sure how to word 2, but that is what I meant.

@matthewrmshin
Copy link
Contributor

Hi @hjoliver I can now see where you are coming from. In which case, I have no objection to a default time out for all managed subprocesses by the suite server program. This should fix the original intention of #2315.

Quick brain dump:

  • The default value should be a bit higher than PT10S to avoid surprises, however. I think we should pick a value that is around 5-15 minutes? (This should handle the worse rsync job log retrieval commands?)
  • I think we should still have an API/CLI/GUI functionality to list and/or send signal to managed subprocesses.
  • Finally, we are still lacking an API functionality to tell the server program to relaunch selected task event handlers.

@hjoliver
Copy link
Member Author

hjoliver commented May 14, 2018

@matthewrmshin - made configurable and upped the timeout to PT10M, as suggested. Tests added. Can you post another issue for your ideas on API to give users more control over the process pool (which I think goes well beyond this PR, so long as the kill timeout is sufficiently long).

@hjoliver
Copy link
Member Author

(Just to note, the motivation for this PR was to replace - and generalize to all pool processes - the function timeout mechanism that I had previously added to #2423, which got nuked by #2590.)

@matthewrmshin
Copy link
Contributor

Tests OK in my environment.

Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments.

@@ -159,7 +163,14 @@ def process(self):
for proc, ctx, callback, callback_args in self.runnings:
ret_code = proc.poll()
if ret_code is None:
runnings.append([proc, ctx, callback, callback_args])
if time.time() > ctx.init_time + self.cmd_timeout:
os.killpg(proc.pid, SIGKILL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should protect this with a try ... except OSError block, just in case the process exits normally between line 165 and this line.

ctx.ret_code = 1
ctx.err = "killed on timeout (%s)" % self.cmd_timeout
self._run_command_exit(ctx, callback, callback_args)
proc.wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better this way?

os.killpg(proc.pid, SIGKILL)
ctx.ret_code = proc.wait()
ctx.out, ctx.err = ctx.communicate()
ctx.err += '\nkilled on timeout (%s)' % self.cmd_timeout
self._run_command_exit(ctx, callback, callback_args)

ctx.ret_code should then be set to the actual exit value. If the program has got some stderr and stdout, then they will be logged as well.

@@ -177,6 +188,7 @@ def process(self):
else:
proc = self._run_command_init(ctx, callback, callback_args)
if proc is not None:
ctx.init_time = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should store the time out time here, so line 106 will only have to compare current time with the time out time, instead of have to do an addition before the comparison.

@@ -159,7 +163,14 @@ def process(self):
for proc, ctx, callback, callback_args in self.runnings:
ret_code = proc.poll()
if ret_code is None:
runnings.append([proc, ctx, callback, callback_args])
if time.time() > ctx.init_time + self.cmd_timeout:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth moving time.time() to before line 163 to reduce the number of system calls.

now = time.time()
for proc, ctx, callback, callback_args in self.runnings:
    ret_code = proc.poll()
    if ret_code is None:
        if now > ctx.init_time + self.cmd_timeout:

@@ -63,6 +64,8 @@ class SuiteProcContext(object):
Return code of the command.
.timestamp (str):
Time string of latest update.
.proc_pool_timeout (int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be .timeout (float)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough.

@matthewrmshin matthewrmshin requested review from oliver-sanders and removed request for sadielbartholomew May 14, 2018 13:45
@matthewrmshin
Copy link
Contributor

@oliver-sanders please sanity check.

@oliver-sanders oliver-sanders merged commit 77281b4 into cylc:master May 16, 2018
@matthewrmshin matthewrmshin modified the milestones: soon, next release May 16, 2018
@hjoliver hjoliver deleted the proc-pool-timeout branch May 16, 2018 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants