Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft of multithreading blog post #408

Merged
merged 22 commits into from
Jul 23, 2019
Merged

draft of multithreading blog post #408

merged 22 commits into from
Jul 23, 2019

Conversation

JeffBezanson
Copy link
Member

Drafty and a bit incomplete in places, but I felt I had enough that we should start the editing and feedback process. I'm particularly interested to know whether this leaves you with any major unanswered questions about what's going on.

From the very beginning --- prior even to the 0.1 release --- Julia has had the `Task`
type, providing symmetric coroutines and event-based I/O.
So we have always had a unit of *concurrency* in the language, it just wasn't *parallel*
(simultaneous streams of execution) yet.
Copy link
Member

@ChrisRackauckas ChrisRackauckas Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change to "real parallelism" here have a practical effect in some computation-based IO stuff, like managing a GPU and then doing CPU operations? Or managing multiple GPUs?


```
$ JULIA_NUM_THREADS=4 ./julia
```
Copy link
Member

@ChrisRackauckas ChrisRackauckas Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that IDEs like Juno automatically detect the number of cores in a user's processor, giving Julia multithreading out of the box when used in these systems.

(I think it is important to mention this for less technical users)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if you automatically detect cores, you still need to be able to change the number - to give some cores to BLAS, or the OS or something else. So, ideally, Juno etc. would present the cores available, but have a way for you to change it in the IDE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does.

Capture

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cool to drop a screenshot (or that screenshot) here

Copy link
Member

@ChrisRackauckas ChrisRackauckas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I just listed the questions I had when reading it. Hopefully that helps.

As we often do, we tried to pick a method that would maximize throughput
and reliability.
We have a shared pool of stacks allocated by `mmap` (`VirtualAlloc` on
windows), defaulting to 4MiB each (2MiB on 32-bit systems).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this changeable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


## Acknowledgements

We would like to gratefully acknowledge funding support from Intel and relational.ai
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We would like to gratefully acknowledge funding support from Intel and relational.ai
We would like to gratefully acknowledge funding support from Intel and relationalAI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they ok with capitalizing it as "RelationalAI"?

Today we are happy to announce a major new chapter in that story.
We are releasing an entirely new threading interface for Julia programs:
fully general task parallelism, inspired by parallel programming systems
like [Cilk][] and [Go][].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should note in which version of Julia (commit or nightly) the new interface is available. Otherwise people may expect it in the latest release version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's easy --- it's not available :)

I think of it as somewhat analogous to garbage collection: with GC, you
can freely allocate objects without worrying about how it works or when and how they
are freed.
With task parallelism, you freely spawn tasks without worrying about where they run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool if we could say that a very large number of tasks (is it tens of thousands, or millions?) can be spawned without worry. Some users coming from HPC may have a pthreads view of the world, and this can help make it clear.

return fib(n - 1) + fetch(t)
end
```

This comment was marked as resolved.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jul 15, 2019

Great blog post so far. This is such exciting stuff. Here are some comments.


Might be worth splitting into two blog posts:

  1. high level overview, usage examples, scaling experiments
  2. details, internals, design decisions (RNG, I/O, etc.)

There’s more impact of taking a sequential code and showing the diff required to make it parallel is tiny. The more direct comparison should also show more scaling. Maybe modify the Base mergesort to be parallel instead? Or compare a simple sequential merge sort with a parallel one? Can additionally compare with the optimized built-in sort and show that it’s a bit faster but only a bit and the parallel one is still faster.

A little coda to the section where you pass temps through the psort! implementation would be good, just summarizing what was done and remarking on how it was pretty simple and maybe showing the improved performance.

The section on rand() seems a bit out of place. Maybe have a section on design decisions that includes that and some of the I/O stuff?

“integers are, fortunately, free” is a bit confusing—why is allocating that much virtual memory unconcerning? A lot of people won’t undertand this.

“In practice, we have an alternate implementation of stack switching”: doesn’t indicate when, if ever, this is used. Maybe add a sentence about how to switch this (compile flag) and that we’ll continue to explore the design space for task stacks to get the best of all worlds as much as possible.

“This is a tricky synchronization problem, since some threads might be scheduling new work while other threads are deciding to block.” Is this meant to be “deciding to sleep”?

“My hands-down favorite”—there are multiple people on the by line, so using first person singular here is confusing.

@JeffBezanson
Copy link
Member Author

“My hands-down favorite”—there are multiple people on the by line, so using first person singular here is confusing.

Should I just switch to "we" everywhere?

@StefanKarpinski
Copy link
Member

Given the multiple authors I think that’s the way to go.

```

This, of course, is the classic highly-inefficient tree recursive implementation of
the Fibonacci sequence --- but running on any number of processor cores!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the Fibonacci sequence --- but running on any number of processor cores!
the Fibonacci sequence--but running on any number of processor cores!

I think markdown will interpret this sequence (or use directly)

Software performance depends more and more on exploiting multiple processor cores.
The [free lunch][] is still over.
Well, we here in the Julia developer community have something of a reputation for
caring about performance, so we've known for years that we would need a good
Copy link
Member

@vtjnash vtjnash Jul 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
caring about performance, so we've known for years that we would need a good
caring about easy performance. We've already built a strong story around multi-process, distributed programming and GPUs. But we've also known that we needed fast and composable multi-threading.

EDIT: compostable->composeable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I sort of like the REUSE symbol for this:

The [free lunch][] is still over.
Well, we here in the Julia developer community have something of a reputation for
caring about performance, so we've known for years that we would need a good
story for multi-threaded, multi-core execution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
story for multi-threaded, multi-core execution.

i = 6 on thread 2
```

Without further ado, let's try some nested parallelism.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Without further ado, let's try some nested parallelism.
A big differentiator of our Julia Task-based parallelism system is the automatic handling of nested parallism. Each Task can act like a first-class future, just running simultaneously to utilize all CPU cores efficiently. So without further ado, let's try some nested parallelism.

for parallelism.
Here is the code:

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we set Julia as the default syntax highlighter, or do we need to annotate the code blocks

half = @par psort!(v, lo, mid) # task to sort the lower half; will run
psort!(v, mid+1, hi) # in parallel with the current call sorting
# the upper half
wait(half) # wait for the lower half to finish
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be fun to use fetch here (implementing sort instead of sort!), as that seems harder to me (relative to what other languages provide and do), and we're already making a copy below

Copy link
Member

@vtjnash vtjnash Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

julia> function psort(v, lo::Int=1, hi::Int=length(v))
                   if lo > hi                       # 1 or 0 elements; nothing to do
                       return similar(v, 0)
                   elseif lo == hi
                       out = similar(v, 1)
                       out[1] = v[lo]
                       return out
                   end
                   if hi - lo < 100000               # below some cutoff, run in serial
                       return sort(view(v, lo:hi), alg = MergeSort)
                   end
               
                   mid = (lo+hi)>>>1                 # find the midpoint
               
                   half = @task psort(v, lo, mid)    # task to sort the lower half; will run
                   half.sticky = false
                   schedule(half)
                   right = psort(v, mid+1, hi)       # in parallel with the current call sorting
                                                     # the upper half
                   left = fetch(half)                # wait for the lower half to finish
                   out = similar(v, hi-lo+1)         # result
                   @assert length(right) + length(left) == length(out)
               
                   i, il, ir = 1, 1, 1               # merge the two sorted sub-arrays
                   @inbounds while il <= length(left) && ir <= length(right)
                       l, r = left[il], right[ir]
                       if l < r
                           out[i] = l
                           il += 1
                       else
                           out[i] = r
                           ir += 1
                       end
                       i += 1
                   end
                   @inbounds while il <= length(left)
                       out[i] = left[il]
                       il += 1
                       i += 1
                   end
                   @inbounds while ir <= length(right)
                       out[i] = right[ir]
                       ir += 1
                       i += 1
                   end
                   return out
               end
julia> using Random; Random.seed!(0); a = rand(20000000);

julia> @time sort(a);
  1.469319 seconds (6 allocations: 152.588 MiB)

julia> @time sort(a);
  1.540864 seconds (6 allocations: 152.588 MiB, 2.91% gc time)

julia> @time psort(a); # 1 thead
 20.526943 seconds (879.42 M allocations: 14.520 GiB, 9.55% gc time)

julia> @time psort(a); # 2 threads
 12.870170 seconds (879.42 M allocations: 14.520 GiB, 15.73% gc time)

julia> @time psort(a);
 10.782067 seconds (879.42 M allocations: 14.520 GiB, 18.62% gc time)

julia> @time psort(a); # 4 threads
  9.499449 seconds (879.42 M allocations: 14.520 GiB, 22.32% gc time)
      ┌────────────────────────────────────────┐ 
   30 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠉⠒⠤⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠈⠑⠢⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠉⠒⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⠦⠤⢄⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠒⠒⠒⠤⠤⠤⠤⢄⣀⣀⣀⣀⣀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
    0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      └────────────────────────────────────────┘ 
      1                                        4

JeffBezanson and others added 8 commits July 20, 2019 12:52
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>
Co-Authored-By: Stefan Karpinski <stefan@karpinski.org>
JeffBezanson and others added 2 commits July 20, 2019 15:46
Co-Authored-By: Kristoffer Carlsson <kcarlsson89@gmail.com>
Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting times ahead! Congrats everyone.

Let's try a different machine with more CPU cores:

```
$ for n in 1 2 4 8 16; do JULIA_NUM_THREADS=$n ./julia psort.jl; done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly summarize what psort.jl does. E.g., does this include compile time? I'm noting the times are longer than above.

Co-Authored-By: Tim Holy <tim.holy@gmail.com>
1.222777 seconds (3.78 k allocations: 686.935 MiB, 9.14% gc time)
0.958517 seconds (3.79 k allocations: 686.935 MiB, 18.21% gc time)
0.836891 seconds (3.78 k allocations: 686.935 MiB, 21.10% gc time)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we graph this and/or show the normalized values?

Co-Authored-By: Tim Holy <tim.holy@gmail.com>
Co-Authored-By: Jameson Nash <vtjnash@gmail.com>

```
lock(cond::Threads.Condition)
while !ready
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while !ready
lock(cond::Threads.Condition)
try
while !ready
wait(cond)
end
finally
unlock(cond)
end

As in previous versions, the standard lock to use to protect critical sections
is `ReentrantLock`, which is now thread-safe (it was previously only used for
synchronizing tasks).
`Threads.SpinLock` is also available, to be used in rare circumstances where
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`Threads.SpinLock` is also available, to be used in rare circumstances where
There are some other types of locks defined internally for specific circumstances which usually should not be applicable to the typical user (these include `Threads.SpinLock`, `Threads.Mutex`, and a variety of libuv-based mutexes protecting various parts of the runtime). These are used in rare circumstances where (1) only threads and not tasks will be synchronized, and (2) you know the the lock will only be held for a short time.

`Threads.SpinLock` is also available, to be used in rare circumstances where
(1) only threads and not tasks need to be synchronized, and (2) you expect to
hold the lock for a short time.
`Semaphore` and `Event` are also available, completing the standard set of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really complete? there'd also be barrier, rwlocks, and a "once" I think in a "standard set"

Suggested change
`Semaphore` and `Event` are also available, completing the standard set of
The `Threads` module also provides `Semaphore` and `Event` types with their standard definition.

is `ReentrantLock`, which is now thread-safe (it was previously only used for
synchronizing tasks).
`Threads.SpinLock` is also available, to be used in rare circumstances where
(1) only threads and not tasks need to be synchronized, and (2) you expect to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(1) only threads and not tasks need to be synchronized, and (2) you expect to

(1) only threads and not tasks need to be synchronized, and (2) you expect to
hold the lock for a short time.
`Semaphore` and `Event` are also available, completing the standard set of
synchronization primitives.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
synchronization primitives.

argument value to allocate space automatically when the caller doesn't provide it:

```
function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()])
function psort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, cld(length(v), 2)) for i = 1:Threads.nthreads()])
Suggested change
function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()])
function psort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, (length(v) + 1) ÷ 2) for i = 1:Threads.nthreads()])
Suggested change
function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()])
function psort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, (hi - lo + 1) ÷ 2) for i = 1:Threads.nthreads()])
Suggested change
function psort!(v, lo::Int=1, hi::Int=length(v), temps = [similar(v,cld(length(v),2)) for i = 1:Threads.nthreads()])
function psort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, 0) for i = 1:Threads.nthreads()]) # and add from Base `(length(t) < m-lo+1) && resize!(t, m-lo+1)`

But for high-performance code we recommend thread-local state.
Our `psort!` routine above can be improved in this way.
Here is a recipe.
First, we modify the function to accept pre-allocated buffers, using a default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
First, we modify the function to accept pre-allocated buffers, using a default
First, we modify the function signature to accept pre-allocated buffers, using a default

Definitely faster, but we do seem to have some work to do on the
scalability of the runtime system.

### Seeding the default random number generator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this after ### IO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it here since I consider this something you might need to know to update code, while the IO section is more internal details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that sounds like a good reason.

Suggested change
### Seeding the default random number generator
### Random number generation
The approach we've taken with Julia's default global random number generator (`rand()` and friends) is to make it thread-specific. On first use, each thread will create an independent instance of the default RNG type (currently Mersenne-Twister) seeded from current system entropy. All operations that affect the random number state then (`rand`, `srand`, `randn`, etc.) now operate on only the current thread's RNG state. This way, multiple independent code sequences that seed and then use random numbers will individually work as expected. If you need all threads to be using a known initial seed, you will need to do that explicitly on each worker thread being used at the start of the algorithm work.
For more precise control, better performance, or other elaborate requirements, we recommend allocating and passing your own RNG objects (e.g. `Rand.MersenneTwister()`).

Here are some of the points we hope to focus on to further develop
our threading capabilities:

* Performance work on task switch and I/O latency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Performance work on task switch and I/O latency.
We would like to gratefully acknowledge funding support from [Intel][] and [relationalAI][]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to put this in this section and delete this content?



## Acknowledgements

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[here]: https://github.com/JuliaLang/julia/pull/31086
[Intel]: https://www.intel.com/
[relationalAI]: http://relational.ai/

An "official" version will appear in a later release, to give us time to settle
on an API we can commit to for the long term.
Here's what you need to know if you want to upgrade your code over this period.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Managing [Task scheduling and synchronization](#Task-scheduling-and-synchronization)
- Managing [Thread-local state](#Thread-local-state)
- Effect on [Random number generation](#Random-Number-Generation)

Definitely faster, but we do seem to have some work to do on the
scalability of the runtime system.

### Seeding the default random number generator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that sounds like a good reason.

Suggested change
### Seeding the default random number generator
### Random number generation
The approach we've taken with Julia's default global random number generator (`rand()` and friends) is to make it thread-specific. On first use, each thread will create an independent instance of the default RNG type (currently Mersenne-Twister) seeded from current system entropy. All operations that affect the random number state then (`rand`, `srand`, `randn`, etc.) now operate on only the current thread's RNG state. This way, multiple independent code sequences that seed and then use random numbers will individually work as expected. If you need all threads to be using a known initial seed, you will need to do that explicitly on each worker thread being used at the start of the algorithm work.
For more precise control, better performance, or other elaborate requirements, we recommend allocating and passing your own RNG objects (e.g. `Rand.MersenneTwister()`).

As we often do, we tried to pick a method that would maximize throughput
and reliability.
We have a shared pool of stacks allocated by `mmap` (`VirtualAlloc` on
windows), defaulting to 4MiB each (2MiB on 32-bit systems).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Using it is not recommended, since it is hard to predict how much stack
space will be needed, for instance by the compiler or called libraries.

A thread can switch to running a given task simply (in principle) by switching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that’s not really true on any platform

Suggested change
A thread can switch to running a given task simply (in principle) by switching
A thread can switch to running a given task by adjusting the registers to appear to “return from” the previous task switch. We allocate a new stack out of a local pool just before we start running it.


We also have an alternate implementation of stack switching (controlled by the
`ALWAYS_COPY_STACKS` variable in `options.h`) that trades time for memory by
copying live stack data when a task switch occurs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
copying live stack data when a task switch occurs.
copying live stack data when a task switch occurs.
This may not be compatible with foreign code that uses `cfunction`,
so it is not the default.

scheduler.
In particular, we need to make sure no other thread sees that task and thinks
"oh, there's a task I can run", causing it to scribble on the scheduler's
stack.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stack.

Here are some of the points we hope to focus on to further develop
our threading capabilities:

* Performance work on task switch and I/O latency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to put this in this section and delete this content?

* More performant parallel loops and reductions, with more scheduling options.
* Allow adding more threads at run time.
* Improved debugging tools.
* Explore API extensions, e.g. cancel points.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope never :P

* Allow adding more threads at run time.
* Improved debugging tools.
* Explore API extensions, e.g. cancel points.
* Thread-safe data structures.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Thread-safe data structures.
* Standard-library of thread-safe data structures for user code.


We are also grateful to the several people who patiently tried this functionality
while it was in development and filed bug reports or pull requests, and spurred us
to keep going!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to keep going!
to keep going! We know there's remaining problems and so will appreciate you letting us know your experience with it through GitHub and Discourse channels!

@JeffBezanson JeffBezanson merged commit 14be227 into master Jul 23, 2019
@delete-merged-branch delete-merged-branch bot deleted the jb/mtblog branch July 23, 2019 15:26
@tknopp
Copy link
Contributor

tknopp commented Jul 25, 2019

@JeffBezanson: Awesome work! Regarding the blog post I missed the origin of the threading effort, which was this PR JuliaLang/julia#6741

I don't want to overrate my work but the prototype was pretty functional and I had the impression that it was an important step towards serious multi-threading. There is also a publication outlining the concepts behind that effort: https://ieeexplore.ieee.org/document/7069898

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants