-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kill proc pool commands that take too long. #2659
Conversation
#2315 is the master issue. I don't think we can have a blanket 10s time out, because some external commands may take a while to get through (e.g. job logs retrieval, badly designed event handler, slow network, submitting jobs on a busy batch system, etc.) The most correct approach is to expose the issue to users, and allow them to tell the suite to kill these commands via the client-server API, the CLI and/or the GUI. |
I'll throw out a couple of other ideas.
|
Yes it would be good for users to have a "window" (via CLI and/or GUI) into what external command sub-processes are currently running, showing the command types (job submit, event handler, etc.) and command lines, along with how long they have been running and an option to kill them. We should put up an issue for this. [or add it to 2315] However,
I suggest (as per Tom's suggestion 1 above) we allow my kill on timeout as a global config option, and just document that if you use a timeout it had better be long enough to avoid killing important processes when the system loads up a bit. Does that seem reasonable? |
@ColemanTom - I don't quite understand your second point, but you may be describing more or less what Cylc does already: it manages the number of spawned subprocesses to be <= N, where N is configurable but defaults to the number of cores on the suite host. Hence the problem here, if all N slots are held by hung processes. |
Yes, I wasn't sure how to word 2, but that is what I meant. |
Hi @hjoliver I can now see where you are coming from. In which case, I have no objection to a default time out for all managed subprocesses by the suite server program. This should fix the original intention of #2315. Quick brain dump:
|
* invoke callback after kill * global config timeout, PT10M default
@matthewrmshin - made configurable and upped the timeout to PT10M, as suggested. Tests added. Can you post another issue for your ideas on API to give users more control over the process pool (which I think goes well beyond this PR, so long as the kill timeout is sufficiently long). |
Tests OK in my environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick comments.
lib/cylc/mp_pool.py
Outdated
@@ -159,7 +163,14 @@ def process(self): | |||
for proc, ctx, callback, callback_args in self.runnings: | |||
ret_code = proc.poll() | |||
if ret_code is None: | |||
runnings.append([proc, ctx, callback, callback_args]) | |||
if time.time() > ctx.init_time + self.cmd_timeout: | |||
os.killpg(proc.pid, SIGKILL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should protect this with a try ... except OSError
block, just in case the process exits normally between line 165 and this line.
lib/cylc/mp_pool.py
Outdated
ctx.ret_code = 1 | ||
ctx.err = "killed on timeout (%s)" % self.cmd_timeout | ||
self._run_command_exit(ctx, callback, callback_args) | ||
proc.wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better this way?
os.killpg(proc.pid, SIGKILL)
ctx.ret_code = proc.wait()
ctx.out, ctx.err = ctx.communicate()
ctx.err += '\nkilled on timeout (%s)' % self.cmd_timeout
self._run_command_exit(ctx, callback, callback_args)
ctx.ret_code
should then be set to the actual exit value. If the program has got some stderr and stdout, then they will be logged as well.
lib/cylc/mp_pool.py
Outdated
@@ -177,6 +188,7 @@ def process(self): | |||
else: | |||
proc = self._run_command_init(ctx, callback, callback_args) | |||
if proc is not None: | |||
ctx.init_time = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should store the time out time here, so line 106 will only have to compare current time with the time out time, instead of have to do an addition before the comparison.
lib/cylc/mp_pool.py
Outdated
@@ -159,7 +163,14 @@ def process(self): | |||
for proc, ctx, callback, callback_args in self.runnings: | |||
ret_code = proc.poll() | |||
if ret_code is None: | |||
runnings.append([proc, ctx, callback, callback_args]) | |||
if time.time() > ctx.init_time + self.cmd_timeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth moving time.time() to before line 163 to reduce the number of system calls.
now = time.time()
for proc, ctx, callback, callback_args in self.runnings:
ret_code = proc.poll()
if ret_code is None:
if now > ctx.init_time + self.cmd_timeout:
lib/cylc/mp_pool.py
Outdated
@@ -63,6 +64,8 @@ class SuiteProcContext(object): | |||
Return code of the command. | |||
.timestamp (str): | |||
Time string of latest update. | |||
.proc_pool_timeout (int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be .timeout (float)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good enough.
@oliver-sanders please sanity check. |
Hung or long-running event handlers (etc.) can cause a suite to stall with tasks in the 'ready' state (i.e. with job submit commands queued but unable to execute because the max number of subprocesses is used up).
(I have some evidence that this has happened at BoM: a suite stalled with 'ready' tasks after reportedly spawning job polling commands that hung for some reason).
This kills subprocesses that are still running after 10 seconds. Easier to do than I expected. @cylc/core - thoughts? (should we do this; should it be configurable; is this approach OK?)