Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option. #38

amitmurthy · 2015-03-01T10:50:01Z

....to have all processes part of both the MPI as well as Julia clusters.

amitmurthy · 2015-03-02T07:12:45Z

@andreasnoack , @ViralBShah , @eschnett , @lcw

The requirement came up in the context of some folks trying to use Julia and MPI parallel constructs together on a Cray machine. Apparently TCP connections from the login node to the compute nodes are not allowed.

Actually, at this time, we are not even sure if TCP connections among processes on the compute nodes are allowed - if not allowed we will have to try using MPI itself for Julia transport, but that is a different issue.

With this PR we will have 3 distinct ways of using MPI.jl.

MPI only mode. All processes are separate and not part of a Julia cluster
MPI only on workers. Julia cluster on master + workers.
MPI and Julia on all processes. Uses mpirun's MPMD mode for startup - this PR.

Any thoughts on this?

eschnett · 2015-03-02T13:12:20Z

You probably don't mean "login node", but rather the node where your job script runs (the one that calls mpirun); that's usually not one of the login nodes. Given that, do you still need this facility? Are ssh connections from the node that runs mpirun also disallowed? That would surprise me.

How does your mechanism work? I see that you pass different options to the master and the workers. Is the master running on the same nodes as the workers, i.e. are you starting one MPI process "too many"? Or are there one fewer Julia worker processes than MPI processes requested from the queuing system?

bcumming · 2015-03-02T14:34:50Z

By login node, we mean nodes from which mpirun is executed (on Cray systems this is the aprun command). On Cray systems these service nodes are used to compile and launch jobs, and they are referred to as login nodes.

We also have external login nodes, which are configured to look like an internal login node, but which are in a physically different systems, and which launch jobs on the internal login nodes.

In this context, they are referring to the internal login nodes.

amitmurthy · 2015-03-02T14:35:48Z

The terminology came from the folks using the Cray system. The equivalent of mpirun is aprun on the Cray - http://docs.cray.com/books/S-2496-4001/S-2496-4001.pdf

But, yes, the node where aprun will be executed. In this mode, all MPI processes are also part of the Julia cluster. MPI rank 0 is Julia master (i.e. Julia pid 1). Other processes (started with the --worker argument) are both MPI processes as well as Julia workers.

We are still awaiting information about networking restrictions (if any) on the Cray.

amitmurthy · 2015-03-02T14:36:46Z

I think we commented at the same time.

psanan · 2015-03-02T14:49:31Z

As mentioned above, the biggest unknown is whether TCP/IP communication is possible between compute nodes on the Cray system. I'm having a difficult time googling this, but perhaps someone at the computing facility itself has a better idea.

These nodes, I believe, run "Compute Node Linux" (CNL).

Here's a 2008 paper that seems to claim that CNL support TCP/IP communication in some sense:
https://cug.org/5-publications/proceedings_attendee_lists/2008CD/S08_Proceedings/pages/Authors/16-19Thursday/Pugmire-Thursday17C/Pugmire-Thursday17C-paper.pdf

Here are some 2013 slides from Cray which imply that TCP/IP can be used alongside the high speed Gemini/Aries networks:
https://cug.org/proceedings/cug2013_proceedings/includes/files/pap105-file2.pdf

eschnett · 2015-03-02T15:31:37Z

Thanks for the clarification. Could you also answer my second question?

"How does your mechanism work? I see that you pass different options to the master and the workers. Is the master running on the same nodes as the workers, i.e. are you starting one MPI process "too many"? Or are there one fewer Julia worker processes than MPI processes requested from the queuing system?"

This is just out of curiosity. I completely agree that Julia's current startup mechanisms don't get along well with high-end HPC systems; in this respect, Crays are still easier to use that BG/Qs.

amitmurthy · 2015-03-02T15:44:47Z

"Is the master running on the same nodes as the workers" - Yes

"i.e. are you starting one MPI process 'too many'? Or are there one fewer Julia worker processes than MPI processes requested from the queuing system" - They are the same in number

mpirun -np 1 julia foo-master.jl : -np 4 julia --worker=custom foo-worker.jl will launch a total of 5 processes. Comm world size is 5, In julia nprocs() would return 5, nworkers() would return 4.

The process running foo-master.jl would have MPI rank 0, julia pid 1.

eschnett · 2015-03-02T16:36:35Z

@amitmurthy In your example, did you request 4 or 5 MPI processes from the queuing system? That is, are you over-allocating the cores?

psanan · 2015-03-02T16:46:13Z

In that example you would need to request 5 cores from the queuing system. On the system we happen to be interested in, which supports hyperthreading, it is possible for these to be virtual cores (two per physical core), so with enough work I think it would be possible to double up the master process with a worker on one of the physical cores. That would probably be premature optimization in many cases, of course.

eschnett · 2015-03-02T18:33:04Z

I like the idea of having several execution models like this. I expect that many people will request the feature you implemented. Having said this, I'm unfortunately not (yet?) familiar enough with the cluster manager to review the patch itself..

psanan · 2015-03-03T15:19:25Z

From another exchange with the supercomputer center, we have clarified things a little wrt the availability of TCP/IP on the target system:

TCP/IP is not available in normal circumstances between the compute nodes. These run CNL ("Compute Node Linux") which lack a socket API.
The hardware does support TCP/IP but the only feasible way to use it is to employ Cray's CCM ("Cluster Compatibility Mode"). This would likely severely degrade performance and defeat the point of trying to run on the cray machine, as using this mode means all kinds of sacrifices, crucially (I believe) not using Cray's optimized interconnect at all (even for MPI).

An avenue that still might be worth pursuing is to implement a custom transport . It was suggested that one possibility for doing this would be to use CCI (https://www.olcf.ornl.gov/center-projects/common-communication-interface/ ), which can sit on top of Cray's GNI (Aries) and would also perhaps make code that we develop in this context reusable across any backends that CCI supports in the future.

@amitmurthy , if this path continued to seem attractive, does it make sense to take the MPI ClusterManager implementation (with the modifications in this PR) and use a new custom transport?

Another piece of information that would be useful is to know how efficient a custom transport layer would need to be. Especially if we are relying on the MPI implementation to do the heavy lifting, would inefficient communication done over TCP/IP likely be performance-limiting? I suspect the answer to this question depends on the granularity and synchronization requirements of the parallel algorithm, so I'll enquire further to that end.

amitmurthy · 2015-03-03T17:07:57Z

Some performance numbers w.r.t. TCP/IP communication in Julia vis-a-vis MPI is being discussed here - JuliaLang/julia#9992

That said we can try out an implementation that uses MPI itself for the Julia cluster setup too. I should have an implementation for this in a few days, and we should be able to get some performance numbers - at least on local machines.

A cursory reading at the CCI link, makes me think that CCI is currently under development and may not be widely used. Am I correct? In that sense having the transport work with the MPI interface seems like a better bet.

lcw · 2015-03-04T17:56:47Z

Unfortunately I have not had the time to review the second and third
ways of using MPI.jl so I don't have any comments on this front. I
think we should at least support

MPI only mode. All processes are separate and not part of a Julia cluster
MPI + Julia cluster in some way (I am not sure what the right way
is)

As far as the communication layer goes. It is not uncommon that on many
large scale clusters the only portable way of using the fast
interconnect is through MPI. So I would avoid posix sockets or any
other type of networking if we are targeting computers on the
Top 500.

psanan · 2015-03-04T18:56:48Z

Using MPI itself for all communication is of course an attractive prospect, as Cray has put a lot of work into its own MPI implementation (based on MPICH) relying on its interconnect interface (GNI). It would be great to test out! (I don't understand the demands of the communication layer well enough to know how arduous of a development task this is, though).

I also got the impression that CCI is under active development - I'll be visiting the center in a few days, hopefully, so will try to ask for more info.

eschnett · 2015-03-04T23:48:55Z

I was hoping that the new cluster manager of Julia had a way to replace sockets by another functionality. If this was possible, then sending data via MPI would be straightforward. We can create a new communicator (so there would be no conflict), and there are already MPI functions to send arbitrary (serializable) data without much fuss. One can then either have one waiting receiver thread per other process, or one thread that waits for a message from any other process.

Last time I looked, this seemed quite possible. The only difficulty was the startup process, since a cluster manager wants to start the remote processes via ssh, whereas MPI relies on mpirun and cannot interactively add new processes.

andreasnoack · 2015-03-04T23:55:47Z

@eschnett What about MPI_SPAWN?

eschnett · 2015-03-05T02:01:53Z

@andreasnoack: On an HPC system, a job receives an allocation with a certain number of cores. You can start additional MPI processes, but these will have to share the same cores that you are already using.

The above is the (policy) limitation to which I was referring; at the technical level, as you say, MPI offers intercommunicators to allow increasing the number of processes.

psanan · 2015-03-05T08:07:02Z

@eschnett, I was wondering about that point: how would the threading would be achieved? In our case, hyperthreading is available, so running a receiver thread per worker MPI process might be quite efficient.

amitmurthy · 2015-03-05T15:04:58Z

Current status of MPI as transport:

Good news : It works!
Not so good news:

The speed of data transfer is around 2 times slower than TCP for small arrays. Slower by 10-20 times for large arrays.

nloops = 10^3
function foo(n)
    a=ones(n)
    remotecall_fetch(2, x->x, a);

    @elapsed for i in 1:nloops
        remotecall_fetch(2, x->x, a)
    end
end

n=10^3
foo(1)
t=foo(n)
println("$t seconds for $nloops loops of send-recv of array size $n")

n=10^6
foo(1)
t=foo(n)
println("$t seconds for $nloops loops of send-recv of array size $n")

MPI Transport:

0.18172557 seconds for 1000 loops of send-recv of array size 1000
38.380218969 seconds for 1000 loops of send-recv of array size 1000000

TCP Transport:

0.091442312 seconds for 1000 loops of send-recv of array size 1000
2.568867051 seconds for 1000 loops of send-recv of array size 1000000

I should be able to fix the large array timing issue, but I don't think I can make the MPI timings faster than TCP.

The second issue is that integration of MPI calls with the libuv is a bit problematic. Julia being single-threaded, if a process is waiting on a MPI.recv, nothing else will be executing in that process. The current implementation does an inefficient busy-wait around MPI.irecv with a yield . I don't know how to work around this till the the time we have multi-threading in Julia. The effect of this currently is that processes will consume 100% CPU even when they are waiting for computation/communication.
The third(minor) issue is the printing of some warning messages upon exit. This is due to the shutdown process of the cluster requiring the communication layer being available, but once we call MPI_Finalize(), the comm layer becomes effectively unavailable.

psanan · 2015-03-05T16:36:10Z

Awesome!

I can run the new example out of the box on my laptop:

[examples (amitm/mpmd=)]$ mpirun -np 1 $JULIA 07-mpi-master.jl : -np 4 $JULIA --worker=custom 07-mpi-worker.jl
0.414936563 seconds for 1000 loops of send-recv of array size 1000
58.243897587 seconds for 1000 loops of send-recv of array size 1000000
EXAMPLE: HELLO
Hello world, I am 0 of 5
Hello world, I am 4 of 5
....

I attempted to run on the cray machine as well, and have run into a couple of the limitations of aprun's MPMD mode:

Apparently MPMD mode with aprun can only load one executable per compute node. This is annoying since it means that an entire compute node (24 cores here) must be devoted to the master process, but there are many compute nodes on the machine, so losing most of one is probably acceptable.
System commands cannot be used with aprun's MPMD mode. This is preventing me from testing further at the moment because we have been wrapping the julia executable in a shell script as a workaround to an unlocated lib/julia/sys.ji, but I suspect there is a way around that.

aprun.x: Apid 913941: Commands are not supported in MPMD mode
aprun.x: Apid 913941: Exiting due to errors. Application aborted

I will need to rebuild julia to try and test this further (since I'm using a version preceding the addition of --worker=custom).

eschnett · 2015-03-05T21:37:29Z

@psanan: If you have real threading (which Julia doesn't have yet), then running one MPI thread per network card should suffice, i.e. one MPI worker thread per node. Which configuration is most efficient -- reserving a core, running on a hyperthread, having one additional thread per node, etc. is probably system-dependent.

@amitmurthy: It is likely that MPI requires a "progress thread" to handle sends and receives. It is also the case that most MPI implementations are not thread-safe, so that one cannot call MPI_Send from one thread while another is stuck in MPI_Recv; even if so, this may not be implemented efficiently. The frameworks I know use MPI_Iprobe and yield in a loop. Yes, this runs the core at 100% even if it is otherwise idle.

@amitmurthy: Why do you need different executables? You could run the same executable, and within the executable call MPI_Comm_rank to determine whether this is the root process or not.

amitmurthy · 2015-03-06T04:37:14Z

irecv does a IProbe, so I guess the busy-wait is similar to other frameworks then. With Julia multi-threading, we can ensure that all MPI calls are made from a single MPI-communication-only thread with an IPC mechanism between the MPI-thread and other threads, but since julia multi-threading is quite a way out, IProbe/yield will have to do till then.

We don't need different executables, but different arguments. The julia processes need to know if they are running as a "master" or a "worker". Different command-line arguments specify this. Currently this needs to be done as part of the process initialization (before any modules are loaded) and hence using MPI rank is not feasible.

I'll see if I can submit a patch against Julia 0.4, which will make it possible for a Julia "master" process to become a "worker" after all initialization is complete.

psanan · 2015-03-06T14:24:24Z

Julia is rebuilt, so I can now run all the usual examples on the cray machine, and woohoo, the MPI-only ClusterManager works!
(For anyone coming across this thread in the future, if you are having problems where sys.ji is not found, this can be remedied by setting JULIA_HOME to /path/to/julia/root/usr/bin in the environment).

Here we run on two nodes:

$ aprun -n 1 ~/julia-dev/julia ~/.julia/v0.4/MPI/examples/07-mpi-master.jl : -n 24 ~/julia-dev/julia --worker=custom ~/.julia/v0.4/MPI/examples/07-mpi-worker.jl
0.577344311 seconds for 1000 loops of send-recv of array size 1000
31.549359867 seconds for 1000 loops of send-recv of array size 1000000
EXAMPLE: HELLO
Hello world, I am 0 of 25
Hello world, I am 24 of 25
Hello world, I am 18 of 25Hello world, I am 17 of 25Hello world, I am 20 of 25
....

amitmurthy · 2015-03-06T14:30:24Z

Awesome! Once JuliaLang/julia#10419 is merged, we will not need MPMD and will be able to use all cores on all nodes.

I'll put in a fix for speeding up data transfer times over the next couple of days

psanan · 2015-03-06T14:47:32Z

Tested on three compute nodes as well:

$ aprun -n 1 ~/julia-dev/julia ~/.julia/v0.4/MPI/examples/07-mpi-master.jl : -n 48 ~/julia-dev/julia --worker=custom ~/.julia/v0.4/MPI/examples/07-mpi-worker.jl
0.628255388 seconds for 1000 loops of send-recv of array size 1000
31.685103334 seconds for 1000 loops of send-recv of array size 1000000
EXAMPLE: HELLO
Hello world, I am 0 of 49
....

amitmurthy · 2015-03-09T12:57:22Z

OK. mpmd is no longer required and MPI-only transport is now around 2-3 times slower than TCP. I'll work on removing this difference.

You can try it out once JuliaLang/julia#10453 Julia is merged.

eschnett · 2015-03-09T15:27:52Z

My naive guess would be that the serialization in MPI creates overhead. If you already know that you are sending e.g. an array of characters, then you can skip the serialization call and instead just call MPI_Isend.

Also -- are you using an irecv that waits for ANY_SOURCE, or are you using one irecv per other process?

amitmurthy · 2015-03-09T16:03:04Z

These are the timings for Julia parallel constructs functions like pmap, @parallel, remotecall*, etc, using MPI only for message transport instead of TCP/IP sockets.

Regular MPI code will be faster.

So, the serialisation overhead is on the Julia side - the request gets packed with a small identifier and the serialisation format embeds the types, etc. The MPI function for sending the serialized data is MPI_Isend.

irecv is used with ANY_SOURCE

eschnett · 2015-03-09T17:09:06Z

This serialization overhead would then be the same for both TCP/IP and MPI, unless the MPI implementation does something wrong.

The timings graph doesn't show the units, and doesn't say what it shows. Naively, I'd assume it shows seconds (since it is called "timings"), and then MPI is already faster than TCP/IP. It also makes sense that shared memory ("sm") is faster than using the loopback device. If MPI is implemented as user-space library (it probably is) and TCP/IP needs to make a kernel round-trip, then this also explains the performance difference.

amitmurthy · 2015-03-10T02:27:51Z

Merging this for now. Will open separate issues to track further improvements.

Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option.

remove argument for option --worker , export init_worker

amitmurthy changed the title ~~WIP/RFC: Using mpirun's MPMD mode of execution....~~ RFC: Using mpirun's MPMD mode of execution.... Mar 3, 2015

amitmurthy changed the title ~~RFC: Using mpirun's MPMD mode of execution....~~ Using mpirun's MPMD mode of execution.... Mar 3, 2015

psanan mentioned this pull request Mar 5, 2015

Minimum MPI Version Supported #39

Open

amitmurthy changed the title ~~Using mpirun's MPMD mode of execution....~~ WIP: Using mpirun's MPMD mode of execution. MPI as a transport option. Mar 5, 2015

amitmurthy force-pushed the amitm/mpmd branch 2 times, most recently from 02f21c5 to e019a35 Compare March 5, 2015 14:44

amitmurthy mentioned this pull request Mar 6, 2015

remove argument for option --worker , export init_worker JuliaLang/julia#10419

Merged

amitmurthy mentioned this pull request Mar 7, 2015

Order Julia workers by MPI rank #42

Closed

amitmurthy changed the title ~~WIP: Using mpirun's MPMD mode of execution. MPI as a transport option.~~ Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option. Mar 10, 2015

MPI transport. Preserve pid/rank order. All procs in MPI cluster

885c395

amitmurthy force-pushed the amitm/mpmd branch from 92d5595 to 885c395 Compare March 10, 2015 02:26

amitmurthy added a commit that referenced this pull request Mar 10, 2015

Merge pull request #38 from JuliaParallel/amitm/mpmd

0682251

Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option.

amitmurthy merged commit 0682251 into master Mar 10, 2015

amitmurthy referenced this pull request in JuliaLang/julia Mar 10, 2015

Merge pull request #10419 from amitmurthy/amitm/worker_init

4cdb615

remove argument for option --worker , export init_worker

simonbyrne deleted the amitm/mpmd branch May 3, 2019 00:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option. #38

Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option. #38

amitmurthy commented Mar 1, 2015

amitmurthy commented Mar 2, 2015

eschnett commented Mar 2, 2015

bcumming commented Mar 2, 2015

amitmurthy commented Mar 2, 2015

amitmurthy commented Mar 2, 2015

psanan commented Mar 2, 2015

eschnett commented Mar 2, 2015

amitmurthy commented Mar 2, 2015

eschnett commented Mar 2, 2015

psanan commented Mar 2, 2015

eschnett commented Mar 2, 2015

psanan commented Mar 3, 2015

amitmurthy commented Mar 3, 2015

lcw commented Mar 4, 2015

psanan commented Mar 4, 2015

eschnett commented Mar 4, 2015

andreasnoack commented Mar 4, 2015

eschnett commented Mar 5, 2015

psanan commented Mar 5, 2015

amitmurthy commented Mar 5, 2015

psanan commented Mar 5, 2015

eschnett commented Mar 5, 2015

amitmurthy commented Mar 6, 2015

psanan commented Mar 6, 2015

amitmurthy commented Mar 6, 2015

psanan commented Mar 6, 2015

amitmurthy commented Mar 9, 2015

eschnett commented Mar 9, 2015

amitmurthy commented Mar 9, 2015

eschnett commented Mar 9, 2015

amitmurthy commented Mar 10, 2015

Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option. #38

Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option. #38

Conversation

amitmurthy commented Mar 1, 2015

amitmurthy commented Mar 2, 2015

eschnett commented Mar 2, 2015

bcumming commented Mar 2, 2015

amitmurthy commented Mar 2, 2015

amitmurthy commented Mar 2, 2015

psanan commented Mar 2, 2015

eschnett commented Mar 2, 2015

amitmurthy commented Mar 2, 2015

eschnett commented Mar 2, 2015

psanan commented Mar 2, 2015

eschnett commented Mar 2, 2015

psanan commented Mar 3, 2015

amitmurthy commented Mar 3, 2015

lcw commented Mar 4, 2015

psanan commented Mar 4, 2015

eschnett commented Mar 4, 2015

andreasnoack commented Mar 4, 2015

eschnett commented Mar 5, 2015

psanan commented Mar 5, 2015

amitmurthy commented Mar 5, 2015

psanan commented Mar 5, 2015

eschnett commented Mar 5, 2015

amitmurthy commented Mar 6, 2015

psanan commented Mar 6, 2015

amitmurthy commented Mar 6, 2015

psanan commented Mar 6, 2015

amitmurthy commented Mar 9, 2015

eschnett commented Mar 9, 2015

amitmurthy commented Mar 9, 2015

eschnett commented Mar 9, 2015

amitmurthy commented Mar 10, 2015