-
-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
draft of multithreading blog post #408
Conversation
From the very beginning --- prior even to the 0.1 release --- Julia has had the `Task` | ||
type, providing symmetric coroutines and event-based I/O. | ||
So we have always had a unit of *concurrency* in the language, it just wasn't *parallel* | ||
(simultaneous streams of execution) yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change to "real parallelism" here have a practical effect in some computation-based IO stuff, like managing a GPU and then doing CPU operations? Or managing multiple GPUs?
|
||
``` | ||
$ JULIA_NUM_THREADS=4 ./julia | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that IDEs like Juno automatically detect the number of cores in a user's processor, giving Julia multithreading out of the box when used in these systems.
(I think it is important to mention this for less technical users)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if you automatically detect cores, you still need to be able to change the number - to give some cores to BLAS, or the OS or something else. So, ideally, Juno etc. would present the cores available, but have a way for you to change it in the IDE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to drop a screenshot (or that screenshot) here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I just listed the questions I had when reading it. Hopefully that helps.
As we often do, we tried to pick a method that would maximize throughput | ||
and reliability. | ||
We have a shared pool of stacks allocated by `mmap` (`VirtualAlloc` on | ||
windows), defaulting to 4MiB each (2MiB on 32-bit systems). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this changeable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
|
||
## Acknowledgements | ||
|
||
We would like to gratefully acknowledge funding support from Intel and relational.ai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would like to gratefully acknowledge funding support from Intel and relational.ai | |
We would like to gratefully acknowledge funding support from Intel and relationalAI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they ok with capitalizing it as "RelationalAI"?
Today we are happy to announce a major new chapter in that story. | ||
We are releasing an entirely new threading interface for Julia programs: | ||
fully general task parallelism, inspired by parallel programming systems | ||
like [Cilk][] and [Go][]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should note in which version of Julia (commit or nightly) the new interface is available. Otherwise people may expect it in the latest release version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's easy --- it's not available :)
I think of it as somewhat analogous to garbage collection: with GC, you | ||
can freely allocate objects without worrying about how it works or when and how they | ||
are freed. | ||
With task parallelism, you freely spawn tasks without worrying about where they run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool if we could say that a very large number of tasks (is it tens of thousands, or millions?) can be spawned without worry. Some users coming from HPC may have a pthreads view of the world, and this can help make it clear.
return fib(n - 1) + fetch(t) | ||
end | ||
``` | ||
|
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
Great blog post so far. This is such exciting stuff. Here are some comments. Might be worth splitting into two blog posts:
There’s more impact of taking a sequential code and showing the diff required to make it parallel is tiny. The more direct comparison should also show more scaling. Maybe modify the Base mergesort to be parallel instead? Or compare a simple sequential merge sort with a parallel one? Can additionally compare with the optimized built-in sort and show that it’s a bit faster but only a bit and the parallel one is still faster. A little coda to the section where you pass temps through the psort! implementation would be good, just summarizing what was done and remarking on how it was pretty simple and maybe showing the improved performance. The section on “integers are, fortunately, free” is a bit confusing—why is allocating that much virtual memory unconcerning? A lot of people won’t undertand this. “In practice, we have an alternate implementation of stack switching”: doesn’t indicate when, if ever, this is used. Maybe add a sentence about how to switch this (compile flag) and that we’ll continue to explore the design space for task stacks to get the best of all worlds as much as possible. “This is a tricky synchronization problem, since some threads might be scheduling new work while other threads are deciding to block.” Is this meant to be “deciding to sleep”? “My hands-down favorite”—there are multiple people on the by line, so using first person singular here is confusing. |
Should I just switch to "we" everywhere? |
Given the multiple authors I think that’s the way to go. |
``` | ||
|
||
This, of course, is the classic highly-inefficient tree recursive implementation of | ||
the Fibonacci sequence --- but running on any number of processor cores! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Fibonacci sequence --- but running on any number of processor cores! | |
the Fibonacci sequence--but running on any number of processor cores! |
I think markdown will interpret this sequence (or use —
directly)
Software performance depends more and more on exploiting multiple processor cores. | ||
The [free lunch][] is still over. | ||
Well, we here in the Julia developer community have something of a reputation for | ||
caring about performance, so we've known for years that we would need a good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caring about performance, so we've known for years that we would need a good | |
caring about easy performance. We've already built a strong story around multi-process, distributed programming and GPUs. But we've also known that we needed fast and composable multi-threading. |
EDIT: compostable->composeable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The [free lunch][] is still over. | ||
Well, we here in the Julia developer community have something of a reputation for | ||
caring about performance, so we've known for years that we would need a good | ||
story for multi-threaded, multi-core execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
story for multi-threaded, multi-core execution. |
i = 6 on thread 2 | ||
``` | ||
|
||
Without further ado, let's try some nested parallelism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without further ado, let's try some nested parallelism. | |
A big differentiator of our Julia Task-based parallelism system is the automatic handling of nested parallism. Each Task can act like a first-class future, just running simultaneously to utilize all CPU cores efficiently. So without further ado, let's try some nested parallelism. |
for parallelism. | ||
Here is the code: | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we set Julia as the default syntax highlighter, or do we need to annotate the code blocks
half = @par psort!(v, lo, mid) # task to sort the lower half; will run | ||
psort!(v, mid+1, hi) # in parallel with the current call sorting | ||
# the upper half | ||
wait(half) # wait for the lower half to finish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be fun to use fetch
here (implementing sort
instead of sort!
), as that seems harder to me (relative to what other languages provide and do), and we're already making a copy below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julia> function psort(v, lo::Int=1, hi::Int=length(v))
if lo > hi # 1 or 0 elements; nothing to do
return similar(v, 0)
elseif lo == hi
out = similar(v, 1)
out[1] = v[lo]
return out
end
if hi - lo < 100000 # below some cutoff, run in serial
return sort(view(v, lo:hi), alg = MergeSort)
end
mid = (lo+hi)>>>1 # find the midpoint
half = @task psort(v, lo, mid) # task to sort the lower half; will run
half.sticky = false
schedule(half)
right = psort(v, mid+1, hi) # in parallel with the current call sorting
# the upper half
left = fetch(half) # wait for the lower half to finish
out = similar(v, hi-lo+1) # result
@assert length(right) + length(left) == length(out)
i, il, ir = 1, 1, 1 # merge the two sorted sub-arrays
@inbounds while il <= length(left) && ir <= length(right)
l, r = left[il], right[ir]
if l < r
out[i] = l
il += 1
else
out[i] = r
ir += 1
end
i += 1
end
@inbounds while il <= length(left)
out[i] = left[il]
il += 1
i += 1
end
@inbounds while ir <= length(right)
out[i] = right[ir]
ir += 1
i += 1
end
return out
end
julia> using Random; Random.seed!(0); a = rand(20000000);
julia> @time sort(a);
1.469319 seconds (6 allocations: 152.588 MiB)
julia> @time sort(a);
1.540864 seconds (6 allocations: 152.588 MiB, 2.91% gc time)
julia> @time psort(a); # 1 thead
20.526943 seconds (879.42 M allocations: 14.520 GiB, 9.55% gc time)
julia> @time psort(a); # 2 threads
12.870170 seconds (879.42 M allocations: 14.520 GiB, 15.73% gc time)
julia> @time psort(a);
10.782067 seconds (879.42 M allocations: 14.520 GiB, 18.62% gc time)
julia> @time psort(a); # 4 threads
9.499449 seconds (879.42 M allocations: 14.520 GiB, 22.32% gc time)
┌────────────────────────────────────────┐
30 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠉⠒⠤⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠈⠑⠢⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠉⠒⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⠦⠤⢄⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠒⠒⠒⠤⠤⠤⠤⢄⣀⣀⣀⣀⣀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└────────────────────────────────────────┘
1 4
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Stefan Karpinski <stefan@karpinski.org>
Co-Authored-By: Kristoffer Carlsson <kcarlsson89@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting times ahead! Congrats everyone.
Let's try a different machine with more CPU cores: | ||
|
||
``` | ||
$ for n in 1 2 4 8 16; do JULIA_NUM_THREADS=$n ./julia psort.jl; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Briefly summarize what psort.jl
does. E.g., does this include compile time? I'm noting the times are longer than above.
Co-Authored-By: Tim Holy <tim.holy@gmail.com>
1.222777 seconds (3.78 k allocations: 686.935 MiB, 9.14% gc time) | ||
0.958517 seconds (3.79 k allocations: 686.935 MiB, 18.21% gc time) | ||
0.836891 seconds (3.78 k allocations: 686.935 MiB, 21.10% gc time) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we graph this and/or show the normalized values?
Co-Authored-By: Tim Holy <tim.holy@gmail.com> Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
|
||
``` | ||
lock(cond::Threads.Condition) | ||
while !ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while !ready | |
lock(cond::Threads.Condition) | |
try | |
while !ready | |
wait(cond) | |
end | |
finally | |
unlock(cond) | |
end |
As in previous versions, the standard lock to use to protect critical sections | ||
is `ReentrantLock`, which is now thread-safe (it was previously only used for | ||
synchronizing tasks). | ||
`Threads.SpinLock` is also available, to be used in rare circumstances where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`Threads.SpinLock` is also available, to be used in rare circumstances where | |
There are some other types of locks defined internally for specific circumstances which usually should not be applicable to the typical user (these include `Threads.SpinLock`, `Threads.Mutex`, and a variety of libuv-based mutexes protecting various parts of the runtime). These are used in rare circumstances where (1) only threads and not tasks will be synchronized, and (2) you know the the lock will only be held for a short time. |
`Threads.SpinLock` is also available, to be used in rare circumstances where | ||
(1) only threads and not tasks need to be synchronized, and (2) you expect to | ||
hold the lock for a short time. | ||
`Semaphore` and `Event` are also available, completing the standard set of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really complete? there'd also be barrier, rwlocks, and a "once" I think in a "standard set"
`Semaphore` and `Event` are also available, completing the standard set of | |
The `Threads` module also provides `Semaphore` and `Event` types with their standard definition. |
is `ReentrantLock`, which is now thread-safe (it was previously only used for | ||
synchronizing tasks). | ||
`Threads.SpinLock` is also available, to be used in rare circumstances where | ||
(1) only threads and not tasks need to be synchronized, and (2) you expect to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) only threads and not tasks need to be synchronized, and (2) you expect to |
(1) only threads and not tasks need to be synchronized, and (2) you expect to | ||
hold the lock for a short time. | ||
`Semaphore` and `Event` are also available, completing the standard set of | ||
synchronization primitives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synchronization primitives. |
argument value to allocate space automatically when the caller doesn't provide it: | ||
|
||
``` | ||
function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()]) | |
function psort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, cld(length(v), 2)) for i = 1:Threads.nthreads()]) |
function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()]) | |
function psort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, (length(v) + 1) ÷ 2) for i = 1:Threads.nthreads()]) |
function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()]) | |
function psort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, (hi - lo + 1) ÷ 2) for i = 1:Threads.nthreads()]) |
function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()]) | |
function psort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, 0) for i = 1:Threads.nthreads()]) # and add from Base `(length(t) < m-lo+1) && resize!(t, m-lo+1)` |
But for high-performance code we recommend thread-local state. | ||
Our `psort!` routine above can be improved in this way. | ||
Here is a recipe. | ||
First, we modify the function to accept pre-allocated buffers, using a default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, we modify the function to accept pre-allocated buffers, using a default | |
First, we modify the function signature to accept pre-allocated buffers, using a default |
Definitely faster, but we do seem to have some work to do on the | ||
scalability of the runtime system. | ||
|
||
### Seeding the default random number generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this after ### IO
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it here since I consider this something you might need to know to update code, while the IO section is more internal details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that sounds like a good reason.
### Seeding the default random number generator | |
### Random number generation | |
The approach we've taken with Julia's default global random number generator (`rand()` and friends) is to make it thread-specific. On first use, each thread will create an independent instance of the default RNG type (currently Mersenne-Twister) seeded from current system entropy. All operations that affect the random number state then (`rand`, `srand`, `randn`, etc.) now operate on only the current thread's RNG state. This way, multiple independent code sequences that seed and then use random numbers will individually work as expected. If you need all threads to be using a known initial seed, you will need to do that explicitly on each worker thread being used at the start of the algorithm work. | |
For more precise control, better performance, or other elaborate requirements, we recommend allocating and passing your own RNG objects (e.g. `Rand.MersenneTwister()`). |
Here are some of the points we hope to focus on to further develop | ||
our threading capabilities: | ||
|
||
* Performance work on task switch and I/O latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Performance work on task switch and I/O latency. | |
We would like to gratefully acknowledge funding support from [Intel][] and [relationalAI][] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to put this in this section and delete this content?
|
||
|
||
## Acknowledgements | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[here]: https://github.com/JuliaLang/julia/pull/31086 | |
[Intel]: https://www.intel.com/ | |
[relationalAI]: http://relational.ai/ |
An "official" version will appear in a later release, to give us time to settle | ||
on an API we can commit to for the long term. | ||
Here's what you need to know if you want to upgrade your code over this period. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Managing [Task scheduling and synchronization](#Task-scheduling-and-synchronization) | |
- Managing [Thread-local state](#Thread-local-state) | |
- Effect on [Random number generation](#Random-Number-Generation) | |
Definitely faster, but we do seem to have some work to do on the | ||
scalability of the runtime system. | ||
|
||
### Seeding the default random number generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that sounds like a good reason.
### Seeding the default random number generator | |
### Random number generation | |
The approach we've taken with Julia's default global random number generator (`rand()` and friends) is to make it thread-specific. On first use, each thread will create an independent instance of the default RNG type (currently Mersenne-Twister) seeded from current system entropy. All operations that affect the random number state then (`rand`, `srand`, `randn`, etc.) now operate on only the current thread's RNG state. This way, multiple independent code sequences that seed and then use random numbers will individually work as expected. If you need all threads to be using a known initial seed, you will need to do that explicitly on each worker thread being used at the start of the algorithm work. | |
For more precise control, better performance, or other elaborate requirements, we recommend allocating and passing your own RNG objects (e.g. `Rand.MersenneTwister()`). |
As we often do, we tried to pick a method that would maximize throughput | ||
and reliability. | ||
We have a shared pool of stacks allocated by `mmap` (`VirtualAlloc` on | ||
windows), defaulting to 4MiB each (2MiB on 32-bit systems). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
Using it is not recommended, since it is hard to predict how much stack | ||
space will be needed, for instance by the compiler or called libraries. | ||
|
||
A thread can switch to running a given task simply (in principle) by switching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that’s not really true on any platform
A thread can switch to running a given task simply (in principle) by switching | |
A thread can switch to running a given task by adjusting the registers to appear to “return from” the previous task switch. We allocate a new stack out of a local pool just before we start running it. |
|
||
We also have an alternate implementation of stack switching (controlled by the | ||
`ALWAYS_COPY_STACKS` variable in `options.h`) that trades time for memory by | ||
copying live stack data when a task switch occurs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copying live stack data when a task switch occurs. | |
copying live stack data when a task switch occurs. | |
This may not be compatible with foreign code that uses `cfunction`, | |
so it is not the default. |
scheduler. | ||
In particular, we need to make sure no other thread sees that task and thinks | ||
"oh, there's a task I can run", causing it to scribble on the scheduler's | ||
stack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stack. |
Here are some of the points we hope to focus on to further develop | ||
our threading capabilities: | ||
|
||
* Performance work on task switch and I/O latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to put this in this section and delete this content?
* More performant parallel loops and reductions, with more scheduling options. | ||
* Allow adding more threads at run time. | ||
* Improved debugging tools. | ||
* Explore API extensions, e.g. cancel points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope never :P
* Allow adding more threads at run time. | ||
* Improved debugging tools. | ||
* Explore API extensions, e.g. cancel points. | ||
* Thread-safe data structures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Thread-safe data structures. | |
* Standard-library of thread-safe data structures for user code. |
|
||
We are also grateful to the several people who patiently tried this functionality | ||
while it was in development and filed bug reports or pull requests, and spurred us | ||
to keep going! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to keep going! | |
to keep going! We know there's remaining problems and so will appreciate you letting us know your experience with it through GitHub and Discourse channels! |
Co-Authored-By: Kristoffer Carlsson <kcarlsson89@gmail.com>
@JeffBezanson: Awesome work! Regarding the blog post I missed the origin of the threading effort, which was this PR JuliaLang/julia#6741 I don't want to overrate my work but the prototype was pretty functional and I had the impression that it was an important step towards serious multi-threading. There is also a publication outlining the concepts behind that effort: https://ieeexplore.ieee.org/document/7069898 |
Drafty and a bit incomplete in places, but I felt I had enough that we should start the editing and feedback process. I'm particularly interested to know whether this leaves you with any major unanswered questions about what's going on.