Improvement of parallel_for implementation #16
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR:
parallel_for
chunk_size
a parameter that takes a sensible default (N/(8*num_domains)
)parallel_for
asparallel_for_sequential
The motivation is to two-fold:
chunk_size
rather than forcing the user to pick achunk_size
which can often be poor even for seemingly sensible values.My design notes for this change are:
Considerations for
parallel_for
There are a couple of subtle things that can happen with
parallel_for
concerning: how work is distributed and how achunk_size
is chosen.Work distribution
Consider the following work distribution methods:
In the sequential scheme, all the overhead of work queuing, signalling waiting workers and synchronizing on results is taken by the single caller thread. In the DC scheme, while it may generate some extra work items (O(log(n))) the overhead of work queuing, signalling waiting workers and synchronizing on results is now distributed over the workers in the pool.
Benchmarking in Sandmark suggests the DC is noticeably better (often by a significant margin at 16 or more domains).
How to choose chunk_size
At first glance, it looks like for balanced workloads you should choose:
n_tasks / n_workers
Firstly experience suggests it is rare for a workload to be truely balanced. Even a basic piece of code with no garbage collection can experience different run times (e.g. different paths through the code, caching or NUMA effects), while in the presence of garbage collection (either minor collection or major slice) different domains can see different execution times for seemingly identical code.
There is a second issue relating to rounding. Consider sequential distribution in the presence of rounding with a chunk_size set to
int(n_tasks / n_workers)
:n_tasks = k * chunk_size + r
where
k = n_tasks / n_workers
r = n_tasks % n_workers
There will then be the following chunks to execute:
k
chunks ofchunk_size
r / chunk_size
chunks ofchunk_size
1
chunk ofr % chunk_size
This means that you can end up with an under-utilized pool once the first
k
chunks are finished. Indeed the critical path can often be "time to dochunk_size
on one worker" greater than the user expected.It is difficult for users to select a default
chunk_size
that will work well with all permutations ofn_task
,n_workers
, task variabilities, machine specifics, etc.With this in mind, I would like to advocate for a default
chunk_size
of:n_tasks / (8 * n_workers)
This attempts to exploit any imbalance and reduce the impact of rounding effects. We can't go arbitrarily larger than
8
as there is an overhead in managing the tasks.(This choice of
8
is also in line with otherparallel_for
implementations out there, in as far as I can tell)Benchmarks
In these sandmark benchmarks, we have the following variants:
dc_parallel_for
when domainslib is using this PRdflt_chunksz
where I have changed sandmark to use achunk_size
ofn_tasks / (8 * n_workers)
when previously it hadn_tasks / n_workers
.The end-result of using the default
chunk_size
and this PR's implementation ofparallel_for
gives you the blue line. For some benchmarks, the impact can be substantial. Potentially more importantly, with this change, it is less likely for a user to oversize their pool and see a reduction in performance.