Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process pool to replace command execution threads. #1012

Merged
merged 13 commits into from
Jul 21, 2014

Conversation

hjoliver
Copy link
Member

Replaces #891 (which, pre cylc-6, targeted the wrong branch); closes #596

A process pool to handle execution of job submission, event handler, and poll and kill commands.
Replaces batched command execution in threads.

@hjoliver hjoliver added this to the cylc-6 milestone Jul 11, 2014
@hjoliver
Copy link
Member Author

Tests passing (except restart tests), but not quite ready for review yet.

@hjoliver hjoliver changed the title Process or thread pool to replace command execution threads. Process pool to replace command execution threads. Jul 11, 2014
@matthewrmshin
Copy link
Contributor

(I am already feeling happy, looking at the net change to the line count.)

@hjoliver
Copy link
Member Author

@matthewrmshin - please review. Note branch squashed again (last time).

Some points to note:

  • passes my test battery (including the restart tests)
  • very much simpler than the old batched submission thread
  • for small pool sizes this will submit jobs more slowly than the old system (when the server is not loaded up)
  • at shutdown I get the worker processes to skip already-queued job submission commands, rather than somehow removing those items from the queue (the "queue", whatever it is, is not exposed by multiprocessing.Pool).
  • command results are retrieved by the main process now so we can just call the associated callback to handle the results (on master, results are retrieved in the job submission thread and we signal the main process via the task messaging queue - an ugly hack if ever there was one)
  • current pool type (thread pool or process pool) and pool size is configured in the suite.rc - probably the wrong thing to do, but easy to change (I was thinking that very large suites might benefit from controlling the pool size...)
  • to test what happens at shutdown its convenient to put a 5 sec sleep in the command execution function, otherwise things tend to execute too fast to test e.g. that queued job submissions cease but queued event handlers complete.
  • under the old system we were not capturing event handler exit status or output (it just goes to suite stdout); this branch does capture it.

def pool_size(self):
"""Return number of workers."""
# ACCESSES POOL INTERNAL STATE
return self.pool._processes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is giving me traceback:

Traceback (most recent call last):
  File "/home/matt/cylc.git/lib/cylc/run.py", line 76, in main
    server.configure()
  File "/home/matt/cylc.git/lib/cylc/scheduler.py", line 224, in configure
    self.configure_suite()
  File "/home/matt/cylc.git/lib/cylc/scheduler.py", line 713, in configure_suite
    self.proc_pool = mp_pool( self.config.cfg['cylc']['shell command execution'])
  File "/home/matt/cylc.git/lib/cylc/mp_pool.py", line 98, in __init__
    self.type, self.pool_size())
  File "/home/matt/cylc.git/lib/cylc/mp_pool.py", line 155, in pool_size
    return self.pool._processes
AttributeError: 'Pool' object has no attribute '_processes'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

damn, it works for me at Python 2.7.5 ... the perils of accessing the module's internals I guess. Are you at 2.6? I'll try later at 2.6 on another server...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part is pretty non-essential - can you just have pool_size() return some arbitrary number for the moment, and carry on? The only multiprocessing.Pool internals that is essential(ish) is in is_dead(), used in scheduler.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on 2.6.6.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change pool_size() to return the input pool size or call multiprocessing.cpu_count() in the case of the default None for a process pool.

@hjoliver
Copy link
Member Author

The main reason I needed a bit of access to the guts of the multiprocessing module is that the interface doesn't provide the ability to see when all the worker processes have exited ("closed" is just a state that lasts until they've exited) other than by calling pool.join() which blocks and so would prevent use of "stop now" after ordering a normal shutdown.

@matthewrmshin
Copy link
Contributor

Would Pool.terminate do the right thing? Or would it overkill?

@matthewrmshin
Copy link
Contributor

Would Pool.terminate do the right thing? Or would it overkill?

(Or worse. Not do enough?)

@hjoliver
Copy link
Member Author

No, Pool.terminate() kills all the worker processes immediately which is only appropriate for a "stop now" or KeyboardInterrupt (which, by the way, can be done on this branch).

@hjoliver
Copy link
Member Author

New commit gets number of workers without accessing Pool internals.

@hjoliver
Copy link
Member Author

Another minor change on this branch: I've disabled the printing of all job submission commands to stdout, except with --debug. (note to self: this may require a few minor changes to user guide tutorial transcripts if we stick with it).


def __init__(self, pool_config):
self.type = pool_config['pool type']
pool_cls = multiprocessing.Pool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have lost an if ...: statement before this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, I don't know how I did that ... another instance of #1022 (comment)

@matthewrmshin
Copy link
Contributor

I am getting a consistent failure of tests/reload/12-runahead.t. The test is expecting 4 items with failed state, but I am only getting 2 items.

@hjoliver
Copy link
Member Author

I've just run that test about 20 times, and got two failures with the same symptoms. I'll investigate tomorrow...

@hjoliver
Copy link
Member Author

I've now run the test 75 times in a loop, and got no more failures.

@matthewrmshin
Copy link
Contributor

I have finally got a full run of the cylc test battery without failure. I guess tests/reload/12-runahead.t is just one of those unstable tests.

Rose test battery requires metomi/rose#1323.

@matthewrmshin
Copy link
Contributor

I'll do more tests tomorrow. I want to know how this change impacts on performance. (Hopefully, it will be faster!) We can probably get rid of the lsof stuff in job submit.

@hjoliver
Copy link
Member Author

I want to know how this change impacts on performance.

For default pool sizes (esp. for a single core machine!) and low system load, I would expect this to submit jobs more slowly than the old system, which submitted batches of 10 jobs at once in parallel. But it should be more robust, and it should perform better under heavy load (esp. if multiples cores are available).

@@ -29,7 +29,7 @@ title = "test all event hooks"

[[prep]]
command scripting = """
printf "%-20s %-8s %s\n" EVENT TASK MESSAGE > $EVNTLOG
printf "%-20s %-8s %s\n" EVENT TASK MESSAGE > {{ EVNTLOG }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was relying on the daemon environment being available to background jobs on the suite host - which is bad form anyway, but not even possible when using a process pool.

Also removes extra blank lines before scheduler methods.
@matthewrmshin
Copy link
Contributor

Some tests in tests/restart/11-bad-state-dump.t are failing. The tests are looking for Suite shutting down.*ERROR patterns in the log. Looks like the suite is no longer printing these messages to the log?

@matthewrmshin
Copy link
Contributor

The lsof logic does appear to throttle the suite. My 3000-task suite runs happily with the lsof logic. It overwhelms atd on my desktop when I remove the lsof logic from this branch. To get the suite to run, I actually have to add a queue (members=root and limit=100) to throttle the suite.

@matthewrmshin
Copy link
Contributor

When suite get overwhelmed, I'd start getting messages like:

2014-07-18T10:52:15+01 WARNING - [t2310.20130102T1200Z] -[Errno 11] Resource temporarily unavailable

And cylc shutdown (and other Pyro-based commands) will fail to connect to the suite.

@matthewrmshin
Copy link
Contributor

One thing I notice is that it still have something like 2014-07-18T12:31:29+01 INFO - Thread-4 exit (Request Handling) in the suite log, but I don't think we have thread 1 to 3 any more?

@matthewrmshin
Copy link
Contributor

Another thing I noticed is that after the suite has received a shut down command, it continues to print out messages about jobs being submitted (although the workers appear to correctly reject them). This may confuse users.

Overall, however, I think the new logic is much more efficient (and even more so if we remove the lsof logic).

@hjoliver
Copy link
Member Author

One thing I notice is that it still have something like 2014-07-18T12:31:29+01 INFO - Thread-4 exit (Request Handling) in the suite log, but I don't think we have thread 1 to 3 any more?

The rundb DAO starts a thread, and multiprocessing.Pool starts two threads. I've turned off the request handling thread ID print, since these other threads do not identify themselves at start-up and shutdown.

if reason:
msg += ' (' + reason + ')'
print msg
if getattr(self, "log", None) is not None:
self.log.info( msg )
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, accidental deletion.

@hjoliver
Copy link
Member Author

Some tests in tests/restart/11-bad-state-dump.t are failing.

Fixed - #1012 (diff)

@hjoliver
Copy link
Member Author

Do you think we can remove the lsof logic? I don't really understand why it was necessary in the first place anyway: jobfiles get explicitly closed, so they should be closed if Python documentation is to be believed (but clearly your old "bad interpreter" errors show otherwise). If it is still necessary, we might be able to move job file write into the new process pool so that it no longers holds up the main process... in fact that might be a good thing to do anyway.

@matthewrmshin
Copy link
Contributor

The lsof logic is one of those things that should never be necessary. The behaviour is however clearly demonstrated in the GISTs I posted at the time I added the logic in. In the absence of the logic, the text file busy error would show its ugly face every now and then in one of our big suites, which was simply not acceptable.

@hjoliver
Copy link
Member Author

Ah, yes - I just revisited #584 to remind myself. Your test programs suggested that (a) we should not need the lsof logic with a process pool; and (b) with a thread pool, moving the file-write into the thread that executes the script, as per my suggestion above, will not help (because your threaded test program writes and executes inside the threads)...(although it could still speed up the main thread or process a bit).

So, is there any good reason to retain the current choice of thread or process pool on this branch, or shall I just delete the thread pool capability (and the lsof code with it)?

@matthewrmshin
Copy link
Contributor

So, is there any good reason to retain the current choice of thread or process pool on this branch, or shall I just delete the thread pool capability (and the lsof code with it)?

Yes, it is a good idea to simplify things.

@hjoliver
Copy link
Member Author

after the suite has received a shut down command, it continues to print out messages about jobs being submitted (although the workers appear to correctly reject them). This may confuse users.

I've changed the 'submitting now' message (which has not been accurate since before batched job submission, really) to 'incrementing submit number', which fits with its use in rundb.py.

@hjoliver
Copy link
Member Author

Test battery passes. I had to change several test suites after bash got upgraded (to a buggy version!) on my box; and the UTC mode clock-trigger test has never worked on non-UTC system clocks, until now.

@hjoliver
Copy link
Member Author

@matthewrmshin - in my opinion we are good to go with this now, but do you want to do more testing?

@hjoliver
Copy link
Member Author

Job submissions sure go off fast without the lsof logic in jobfile.py.

@matthewrmshin
Copy link
Contributor

(I am now doing some final testing.)

@matthewrmshin
Copy link
Contributor

(Restart tests 04 and 09 are still unstable, but I don't think the instability has anything to do with this change.)

matthewrmshin added a commit that referenced this pull request Jul 21, 2014
Process pool to replace command execution threads.
@matthewrmshin matthewrmshin merged commit de8a2e8 into cylc:119.iso8601-cycling Jul 21, 2014
@hjoliver hjoliver deleted the procpool branch July 21, 2014 10:14
@@ -117,7 +118,7 @@ force_restart|2013092300|1|1|succeeded
force_restart|2013092306|1|1|succeeded
force_restart|2013092312|0|1|held
output_states|2013092300|1|1|succeeded
output_states|2013092306|1|1|running
output_states|2013092306|2|1|running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to incorporate this into another branch - but I don't understand this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look odd - I'll attempt to understand it later today...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants