-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process pool to replace command execution threads. #1012
Conversation
Tests passing (except restart tests), but not quite ready for review yet. |
(I am already feeling happy, looking at the net change to the line count.) |
@matthewrmshin - please review. Note branch squashed again (last time). Some points to note:
|
def pool_size(self): | ||
"""Return number of workers.""" | ||
# ACCESSES POOL INTERNAL STATE | ||
return self.pool._processes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is giving me traceback:
Traceback (most recent call last):
File "/home/matt/cylc.git/lib/cylc/run.py", line 76, in main
server.configure()
File "/home/matt/cylc.git/lib/cylc/scheduler.py", line 224, in configure
self.configure_suite()
File "/home/matt/cylc.git/lib/cylc/scheduler.py", line 713, in configure_suite
self.proc_pool = mp_pool( self.config.cfg['cylc']['shell command execution'])
File "/home/matt/cylc.git/lib/cylc/mp_pool.py", line 98, in __init__
self.type, self.pool_size())
File "/home/matt/cylc.git/lib/cylc/mp_pool.py", line 155, in pool_size
return self.pool._processes
AttributeError: 'Pool' object has no attribute '_processes'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
damn, it works for me at Python 2.7.5 ... the perils of accessing the module's internals I guess. Are you at 2.6? I'll try later at 2.6 on another server...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That part is pretty non-essential - can you just have pool_size() return some arbitrary number for the moment, and carry on? The only multiprocessing.Pool internals that is essential(ish) is in is_dead(), used in scheduler.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am on 2.6.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change pool_size()
to return the input pool size or call multiprocessing.cpu_count()
in the case of the default None
for a process pool.
The main reason I needed a bit of access to the guts of the |
Would |
(Or worse. Not do enough?) |
No, |
New commit gets number of workers without accessing Pool internals. |
Another minor change on this branch: I've disabled the printing of all job submission commands to stdout, except with |
|
||
def __init__(self, pool_config): | ||
self.type = pool_config['pool type'] | ||
pool_cls = multiprocessing.Pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have lost an if ...:
statement before this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, I don't know how I did that ... another instance of #1022 (comment)
I am getting a consistent failure of |
I've just run that test about 20 times, and got two failures with the same symptoms. I'll investigate tomorrow... |
I've now run the test 75 times in a loop, and got no more failures. |
I have finally got a full run of the cylc test battery without failure. I guess Rose test battery requires metomi/rose#1323. |
I'll do more tests tomorrow. I want to know how this change impacts on performance. (Hopefully, it will be faster!) We can probably get rid of the |
For default pool sizes (esp. for a single core machine!) and low system load, I would expect this to submit jobs more slowly than the old system, which submitted batches of 10 jobs at once in parallel. But it should be more robust, and it should perform better under heavy load (esp. if multiples cores are available). |
@@ -29,7 +29,7 @@ title = "test all event hooks" | |||
|
|||
[[prep]] | |||
command scripting = """ | |||
printf "%-20s %-8s %s\n" EVENT TASK MESSAGE > $EVNTLOG | |||
printf "%-20s %-8s %s\n" EVENT TASK MESSAGE > {{ EVNTLOG }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was relying on the daemon environment being available to background jobs on the suite host - which is bad form anyway, but not even possible when using a process pool.
Also removes extra blank lines before scheduler methods.
Some tests in |
The |
When suite get overwhelmed, I'd start getting messages like:
And |
One thing I notice is that it still have something like |
Another thing I noticed is that after the suite has received a shut down command, it continues to print out messages about jobs being submitted (although the workers appear to correctly reject them). This may confuse users. Overall, however, I think the new logic is much more efficient (and even more so if we remove the |
The rundb DAO starts a thread, and |
if reason: | ||
msg += ' (' + reason + ')' | ||
print msg | ||
if getattr(self, "log", None) is not None: | ||
self.log.info( msg ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, accidental deletion.
Fixed - #1012 (diff) |
Do you think we can remove the |
The |
Ah, yes - I just revisited #584 to remind myself. Your test programs suggested that (a) we should not need the lsof logic with a process pool; and (b) with a thread pool, moving the file-write into the thread that executes the script, as per my suggestion above, will not help (because your threaded test program writes and executes inside the threads)...(although it could still speed up the main thread or process a bit). So, is there any good reason to retain the current choice of thread or process pool on this branch, or shall I just delete the thread pool capability (and the lsof code with it)? |
Yes, it is a good idea to simplify things. |
I've changed the 'submitting now' message (which has not been accurate since before batched job submission, really) to 'incrementing submit number', which fits with its use in |
Test battery passes. I had to change several test suites after bash got upgraded (to a buggy version!) on my box; and the UTC mode clock-trigger test has never worked on non-UTC system clocks, until now. |
@matthewrmshin - in my opinion we are good to go with this now, but do you want to do more testing? |
Job submissions sure go off fast without the lsof logic in |
(I am now doing some final testing.) |
(Restart tests 04 and 09 are still unstable, but I don't think the instability has anything to do with this change.) |
Process pool to replace command execution threads.
@@ -117,7 +118,7 @@ force_restart|2013092300|1|1|succeeded | |||
force_restart|2013092306|1|1|succeeded | |||
force_restart|2013092312|0|1|held | |||
output_states|2013092300|1|1|succeeded | |||
output_states|2013092306|1|1|running | |||
output_states|2013092306|2|1|running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to incorporate this into another branch - but I don't understand this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does look odd - I'll attempt to understand it later today...
Broken since cylc/cylc-flow#1012.
Replaces #891 (which, pre cylc-6, targeted the wrong branch); closes #596
A process pool to handle execution of job submission, event handler, and poll and kill commands.
Replaces batched command execution in threads.