Process pool to replace command execution threads. #1012

hjoliver · 2014-07-11T13:06:04Z

Replaces #891 (which, pre cylc-6, targeted the wrong branch); closes #596

A process pool to handle execution of job submission, event handler, and poll and kill commands.
Replaces batched command execution in threads.

hjoliver · 2014-07-11T13:06:19Z

Tests passing (except restart tests), but not quite ready for review yet.

matthewrmshin · 2014-07-15T07:40:45Z

(I am already feeling happy, looking at the net change to the line count.)

hjoliver · 2014-07-16T08:19:22Z

@matthewrmshin - please review. Note branch squashed again (last time).

Some points to note:

passes my test battery (including the restart tests)
very much simpler than the old batched submission thread
for small pool sizes this will submit jobs more slowly than the old system (when the server is not loaded up)
at shutdown I get the worker processes to skip already-queued job submission commands, rather than somehow removing those items from the queue (the "queue", whatever it is, is not exposed by multiprocessing.Pool).
command results are retrieved by the main process now so we can just call the associated callback to handle the results (on master, results are retrieved in the job submission thread and we signal the main process via the task messaging queue - an ugly hack if ever there was one)
current pool type (thread pool or process pool) and pool size is configured in the suite.rc - probably the wrong thing to do, but easy to change (I was thinking that very large suites might benefit from controlling the pool size...)
to test what happens at shutdown its convenient to put a 5 sec sleep in the command execution function, otherwise things tend to execute too fast to test e.g. that queued job submissions cease but queued event handlers complete.
under the old system we were not capturing event handler exit status or output (it just goes to suite stdout); this branch does capture it.

matthewrmshin · 2014-07-16T10:27:53Z

lib/cylc/mp_pool.py

+    def pool_size(self):
+        """Return number of workers."""
+        # ACCESSES POOL INTERNAL STATE
+        return self.pool._processes


This line is giving me traceback:

Traceback (most recent call last): File "/home/matt/cylc.git/lib/cylc/run.py", line 76, in main server.configure() File "/home/matt/cylc.git/lib/cylc/scheduler.py", line 224, in configure self.configure_suite() File "/home/matt/cylc.git/lib/cylc/scheduler.py", line 713, in configure_suite self.proc_pool = mp_pool( self.config.cfg['cylc']['shell command execution']) File "/home/matt/cylc.git/lib/cylc/mp_pool.py", line 98, in __init__ self.type, self.pool_size()) File "/home/matt/cylc.git/lib/cylc/mp_pool.py", line 155, in pool_size return self.pool._processes AttributeError: 'Pool' object has no attribute '_processes'

damn, it works for me at Python 2.7.5 ... the perils of accessing the module's internals I guess. Are you at 2.6? I'll try later at 2.6 on another server...

That part is pretty non-essential - can you just have pool_size() return some arbitrary number for the moment, and carry on? The only multiprocessing.Pool internals that is essential(ish) is in is_dead(), used in scheduler.py.

I am on 2.6.6.

I'll change pool_size() to return the input pool size or call multiprocessing.cpu_count() in the case of the default None for a process pool.

hjoliver · 2014-07-16T10:57:34Z

The main reason I needed a bit of access to the guts of the multiprocessing module is that the interface doesn't provide the ability to see when all the worker processes have exited ("closed" is just a state that lasts until they've exited) other than by calling pool.join() which blocks and so would prevent use of "stop now" after ordering a normal shutdown.

matthewrmshin · 2014-07-16T11:13:57Z

Would Pool.terminate do the right thing? Or would it overkill?

matthewrmshin · 2014-07-16T11:14:44Z

Would Pool.terminate do the right thing? Or would it overkill?

(Or worse. Not do enough?)

hjoliver · 2014-07-16T11:19:08Z

No, Pool.terminate() kills all the worker processes immediately which is only appropriate for a "stop now" or KeyboardInterrupt (which, by the way, can be done on this branch).

hjoliver · 2014-07-16T12:37:40Z

New commit gets number of workers without accessing Pool internals.

hjoliver · 2014-07-16T12:39:51Z

Another minor change on this branch: I've disabled the printing of all job submission commands to stdout, except with --debug. (note to self: this may require a few minor changes to user guide tutorial transcripts if we stick with it).

matthewrmshin · 2014-07-17T08:00:37Z

lib/cylc/mp_pool.py

+
+    def __init__(self, pool_config):
+        self.type = pool_config['pool type']
+            pool_cls = multiprocessing.Pool 


I think you have lost an if ...: statement before this.

Damn, I don't know how I did that ... another instance of #1022 (comment)

matthewrmshin · 2014-07-17T12:08:18Z

I am getting a consistent failure of tests/reload/12-runahead.t. The test is expecting 4 items with failed state, but I am only getting 2 items.

hjoliver · 2014-07-17T13:26:13Z

I've just run that test about 20 times, and got two failures with the same symptoms. I'll investigate tomorrow...

hjoliver · 2014-07-17T13:57:25Z

I've now run the test 75 times in a loop, and got no more failures.

matthewrmshin · 2014-07-17T19:39:02Z

I have finally got a full run of the cylc test battery without failure. I guess tests/reload/12-runahead.t is just one of those unstable tests.

Rose test battery requires metomi/rose#1323.

matthewrmshin · 2014-07-17T19:41:34Z

I'll do more tests tomorrow. I want to know how this change impacts on performance. (Hopefully, it will be faster!) We can probably get rid of the lsof stuff in job submit.

hjoliver · 2014-07-17T22:09:21Z

I want to know how this change impacts on performance.

For default pool sizes (esp. for a single core machine!) and low system load, I would expect this to submit jobs more slowly than the old system, which submitted batches of 10 jobs at once in parallel. But it should be more robust, and it should perform better under heavy load (esp. if multiples cores are available).

hjoliver · 2014-07-17T22:20:54Z

tests/events/task/suite.rc

@@ -29,7 +29,7 @@ title = "test all event hooks"

    [[prep]]
        command scripting = """
-printf "%-20s %-8s %s\n" EVENT TASK MESSAGE > $EVNTLOG
+printf "%-20s %-8s %s\n" EVENT TASK MESSAGE > {{ EVNTLOG }}


This test was relying on the daemon environment being available to background jobs on the suite host - which is bad form anyway, but not even possible when using a process pool.

Also removes extra blank lines before scheduler methods.

matthewrmshin · 2014-07-18T08:06:19Z

Some tests in tests/restart/11-bad-state-dump.t are failing. The tests are looking for Suite shutting down.*ERROR patterns in the log. Looks like the suite is no longer printing these messages to the log?

matthewrmshin · 2014-07-18T09:15:58Z

The lsof logic does appear to throttle the suite. My 3000-task suite runs happily with the lsof logic. It overwhelms atd on my desktop when I remove the lsof logic from this branch. To get the suite to run, I actually have to add a queue (members=root and limit=100) to throttle the suite.

matthewrmshin · 2014-07-18T10:00:36Z

When suite get overwhelmed, I'd start getting messages like:

2014-07-18T10:52:15+01 WARNING - [t2310.20130102T1200Z] -[Errno 11] Resource temporarily unavailable

And cylc shutdown (and other Pyro-based commands) will fail to connect to the suite.

matthewrmshin · 2014-07-18T11:32:46Z

One thing I notice is that it still have something like 2014-07-18T12:31:29+01 INFO - Thread-4 exit (Request Handling) in the suite log, but I don't think we have thread 1 to 3 any more?

matthewrmshin · 2014-07-18T12:58:26Z

Another thing I noticed is that after the suite has received a shut down command, it continues to print out messages about jobs being submitted (although the workers appear to correctly reject them). This may confuse users.

Overall, however, I think the new logic is much more efficient (and even more so if we remove the lsof logic).

hjoliver · 2014-07-20T04:17:06Z

One thing I notice is that it still have something like 2014-07-18T12:31:29+01 INFO - Thread-4 exit (Request Handling) in the suite log, but I don't think we have thread 1 to 3 any more?

The rundb DAO starts a thread, and multiprocessing.Pool starts two threads. I've turned off the request handling thread ID print, since these other threads do not identify themselves at start-up and shutdown.

hjoliver · 2014-07-20T04:22:13Z

lib/cylc/scheduler.py

        if reason:
            msg += ' (' + reason + ')'
        print msg
-        if getattr(self, "log", None) is not None:
-            self.log.info( msg )


Oops, accidental deletion.

hjoliver · 2014-07-20T04:24:53Z

Some tests in tests/restart/11-bad-state-dump.t are failing.

Fixed - #1012 (diff)

hjoliver · 2014-07-20T04:42:32Z

Do you think we can remove the lsof logic? I don't really understand why it was necessary in the first place anyway: jobfiles get explicitly closed, so they should be closed if Python documentation is to be believed (but clearly your old "bad interpreter" errors show otherwise). If it is still necessary, we might be able to move job file write into the new process pool so that it no longers holds up the main process... in fact that might be a good thing to do anyway.

matthewrmshin · 2014-07-20T06:10:56Z

The lsof logic is one of those things that should never be necessary. The behaviour is however clearly demonstrated in the GISTs I posted at the time I added the logic in. In the absence of the logic, the text file busy error would show its ugly face every now and then in one of our big suites, which was simply not acceptable.

hjoliver · 2014-07-20T09:19:05Z

Ah, yes - I just revisited #584 to remind myself. Your test programs suggested that (a) we should not need the lsof logic with a process pool; and (b) with a thread pool, moving the file-write into the thread that executes the script, as per my suggestion above, will not help (because your threaded test program writes and executes inside the threads)...(although it could still speed up the main thread or process a bit).

So, is there any good reason to retain the current choice of thread or process pool on this branch, or shall I just delete the thread pool capability (and the lsof code with it)?

matthewrmshin · 2014-07-20T17:19:02Z

So, is there any good reason to retain the current choice of thread or process pool on this branch, or shall I just delete the thread pool capability (and the lsof code with it)?

Yes, it is a good idea to simplify things.

hjoliver · 2014-07-20T23:59:45Z

after the suite has received a shut down command, it continues to print out messages about jobs being submitted (although the workers appear to correctly reject them). This may confuse users.

I've changed the 'submitting now' message (which has not been accurate since before batched job submission, really) to 'incrementing submit number', which fits with its use in rundb.py.

hjoliver · 2014-07-21T01:15:04Z

Test battery passes. I had to change several test suites after bash got upgraded (to a buggy version!) on my box; and the UTC mode clock-trigger test has never worked on non-UTC system clocks, until now.

hjoliver · 2014-07-21T01:16:24Z

@matthewrmshin - in my opinion we are good to go with this now, but do you want to do more testing?

hjoliver · 2014-07-21T05:56:59Z

Job submissions sure go off fast without the lsof logic in jobfile.py.

matthewrmshin · 2014-07-21T07:49:03Z

(I am now doing some final testing.)

matthewrmshin · 2014-07-21T08:39:56Z

(Restart tests 04 and 09 are still unstable, but I don't think the instability has anything to do with this change.)

Process pool to replace command execution threads.

benfitzpatrick · 2014-07-22T10:18:59Z

tests/restart/04-running.t

@@ -117,7 +118,7 @@ force_restart|2013092300|1|1|succeeded
 force_restart|2013092306|1|1|succeeded
 force_restart|2013092312|0|1|held
 output_states|2013092300|1|1|succeeded
-output_states|2013092306|1|1|running
+output_states|2013092306|2|1|running


I'm trying to incorporate this into another branch - but I don't understand this change.

It does look odd - I'll attempt to understand it later today...

Broken since cylc/cylc-flow#1012.

hjoliver added this to the cylc-6 milestone Jul 11, 2014

hjoliver changed the title ~~Process or thread pool to replace command execution threads.~~ Process pool to replace command execution threads. Jul 11, 2014

matthewrmshin assigned hjoliver Jul 15, 2014

Squashed new proc pool changes.

cb21f68

matthewrmshin reviewed Jul 16, 2014
View reviewed changes

Record pool size to avoid using multiprocessing.Pool internals.

9aa2a6a

matthewrmshin reviewed Jul 17, 2014
View reviewed changes

Restored missing "if" statement.

4a08575

hjoliver reviewed Jul 17, 2014
View reviewed changes

Retrieve final proc pool results

3bcf4d4

Also removes extra blank lines before scheduler methods.

hjoliver reviewed Jul 20, 2014
View reviewed changes

Restored shutdown logging (accidentally deleted).

11f1403

hjoliver and others added 3 commits July 21, 2014 10:02

Comments updated.

315b535

Process pool only - removed threads.

8adc897

Removed lsof logic (not required for proc pool).

f863702

hjoliver added 3 commits July 21, 2014 12:29

Changed misleading "submitting now" message.

d814bab

tests - bash 4.2.0 compat

d83ec7b

Fix UTC mode clock-trigger test.

81321d7

matthewrmshin added a commit that referenced this pull request Jul 21, 2014

Merge pull request #1012 from hjoliver/procpool

de8a2e8

Process pool to replace command execution threads.

matthewrmshin merged commit de8a2e8 into cylc:119.iso8601-cycling Jul 21, 2014

hjoliver deleted the procpool branch July 21, 2014 10:14

benfitzpatrick reviewed Jul 22, 2014
View reviewed changes

hjoliver mentioned this pull request Aug 13, 2014

siterc: configurable event hooks and timeouts #802

Closed

matthewrmshin added a commit to matthewrmshin/rose that referenced this pull request Aug 14, 2014

rose bush: fix jobs list

748df2b

Broken since cylc/cylc-flow#1012.

This was referenced Aug 14, 2014

rose bush: fix jobs list metomi/rose#1372

Merged

Initialise process pool early #1103

Merged

matthewrmshin mentioned this pull request May 8, 2015

cylc warnings: started before submitted #905

Closed

hjoliver mentioned this pull request Nov 12, 2015

batching of ssh commands by cylc #503

Closed

Process pool to replace command execution threads. #1012

Process pool to replace command execution threads. #1012

Conversation

hjoliver commented Jul 11, 2014

hjoliver commented Jul 11, 2014

matthewrmshin commented Jul 15, 2014

hjoliver commented Jul 16, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjoliver commented Jul 16, 2014

matthewrmshin commented Jul 16, 2014

matthewrmshin commented Jul 16, 2014

hjoliver commented Jul 16, 2014

hjoliver commented Jul 16, 2014

hjoliver commented Jul 16, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewrmshin commented Jul 17, 2014

hjoliver commented Jul 17, 2014

hjoliver commented Jul 17, 2014

matthewrmshin commented Jul 17, 2014

matthewrmshin commented Jul 17, 2014

hjoliver commented Jul 17, 2014

Choose a reason for hiding this comment

matthewrmshin commented Jul 18, 2014

matthewrmshin commented Jul 18, 2014

matthewrmshin commented Jul 18, 2014

matthewrmshin commented Jul 18, 2014

matthewrmshin commented Jul 18, 2014

hjoliver commented Jul 20, 2014

Choose a reason for hiding this comment

hjoliver commented Jul 20, 2014

hjoliver commented Jul 20, 2014

matthewrmshin commented Jul 20, 2014

hjoliver commented Jul 20, 2014

matthewrmshin commented Jul 20, 2014

hjoliver commented Jul 20, 2014

hjoliver commented Jul 21, 2014

hjoliver commented Jul 21, 2014

hjoliver commented Jul 21, 2014

matthewrmshin commented Jul 21, 2014

matthewrmshin commented Jul 21, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment