optimized send - direct writes for large bitstype arrays #10073

amitmurthy · 2015-02-04T12:06:46Z

Whenever a request exceeds 100K in serialized length, it is directly written to the socket.

Supersedes #6876, partially address #9992

amitmurthy · 2015-02-04T12:20:23Z

Current master:

julia> a=ones(10^7);

julia> println(mean([@elapsed remotecall_fetch(2, (x)->x, a) for i in 1:10]))
0.1583344692

julia> s=[randstring() for x in 1:10^5];

julia> println(mean([@elapsed remotecall_fetch(2, (x)->x, s) for i in 1:10]))
0.2919245722

julia> println(@elapsed [remotecall_fetch(2, myid) for i in 1:10000])
0.618292734

With this PR:

julia> a=ones(10^7);

julia> println(mean([@elapsed remotecall_fetch(2, (x)->x, a) for i in 1:10]))
0.0611091225

julia> s=[randstring() for x in 1:10^5];

julia> println(mean([@elapsed remotecall_fetch(2, (x)->x, s) for i in 1:10]))
0.1801866141

julia> println(@elapsed [remotecall_fetch(2, myid) for i in 1:10000])
0.602078664

amitmurthy · 2015-02-07T14:57:14Z

@andreasnoack - could you please test this out? I would like to merge it soon.
@vtjnash / @Keno - could you review the code if possible?

andreasnoack · 2015-02-08T01:25:19Z

Yes. Sorry for not getting back to you on this one. I'll run the benchmarks tomorrow and report back.

andreasnoack · 2015-02-09T02:03:00Z

Her are the results and they confirm your findings. This is really great. First the plot of the time it takes to move an array of Float64s to a worker against the size of the array.

and next the relative timings to MPI with network transport (MPI-TCP).

So instead of being four times MPI we'll now be two times MPI (for large arrays).
cc: @alanedelman

optimized send - direct writes for large bitstype arrays

ViralBShah · 2015-02-09T05:36:34Z

Do we know where we are losing the 2x now?

If we can make this work as well as MPI-tcp, that may lead to reconsidering the darray design. Of course, we would also need efficient implementation of collectives at some basic level.

amitmurthy · 2015-02-09T05:45:59Z

Not yet, but we will keep chipping away. The IOBuffer used to aggregate messages till 100K in length can be replaced by a fixed length array - will remove unnecessary allocs for small requests too.

Optimizing @everywhere where we serialize the same request multiple times (and send it multiple times too) is another TODO.

JeffBezanson · 2015-02-12T17:12:50Z

This is great! Could you briefly explain the reason to introduce BufferedAsyncStream rather than add this behavior to existing AsyncStream types?

amitmurthy · 2015-02-13T03:28:12Z

We should add it to all existing AsyncStream types. In #6876 there was a discussion about having this support for all IO, adding a buffer to File and getting rid of IOStream entirely - which I am unfamiliar on how to go about.

I figured it best to redo the PR to just solve the single problem for sending large arrays in our parallel infrastructure while the other issues get sorted out.

I can submit a PR that will move the implementation to all AsyncStream types, while exporting 3 new functions - lock, unlock and set_lockable. The last one since by default we want AsyncStream's to retain existing behavior, i.e., no buffering.

amitmurthy mentioned this pull request Feb 4, 2015

tcpsocket supports a send buffer #6876

Closed

ViralBShah added the parallelism Parallel or distributed computation label Feb 5, 2015

optimized send - direct writes for large bitstype arrays

748c5df

amitmurthy added a commit that referenced this pull request Feb 9, 2015

Merge pull request #10073 from amitmurthy/amitm/optsend2

6558327

optimized send - direct writes for large bitstype arrays

amitmurthy merged commit 6558327 into JuliaLang:master Feb 9, 2015

andreasnoack mentioned this pull request Feb 9, 2015

Speed of data movement in @spawn #9992

Closed

amitmurthy mentioned this pull request Feb 15, 2015

RFC: handle serializing objects with cycles using a Serializer object #10170

Merged

vtjnash mentioned this pull request Jan 8, 2019

memory usage for fetch #4508

Closed

amitmurthy mentioned this pull request Aug 6, 2019

Improve performance of large array transfers when using MPI transport. JuliaParallel/MPIClusterManagers.jl#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimized send - direct writes for large bitstype arrays #10073

optimized send - direct writes for large bitstype arrays #10073

amitmurthy commented Feb 4, 2015

amitmurthy commented Feb 4, 2015

amitmurthy commented Feb 7, 2015

andreasnoack commented Feb 8, 2015

andreasnoack commented Feb 9, 2015

ViralBShah commented Feb 9, 2015

amitmurthy commented Feb 9, 2015

JeffBezanson commented Feb 12, 2015

amitmurthy commented Feb 13, 2015

optimized send - direct writes for large bitstype arrays #10073

optimized send - direct writes for large bitstype arrays #10073

Conversation

amitmurthy commented Feb 4, 2015

amitmurthy commented Feb 4, 2015

amitmurthy commented Feb 7, 2015

andreasnoack commented Feb 8, 2015

andreasnoack commented Feb 9, 2015

ViralBShah commented Feb 9, 2015

amitmurthy commented Feb 9, 2015

JeffBezanson commented Feb 12, 2015

amitmurthy commented Feb 13, 2015