Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option. #38

Merged
merged 1 commit into from
Mar 10, 2015

Conversation

amitmurthy
Copy link
Contributor

....to have all processes part of both the MPI as well as Julia clusters.

@amitmurthy
Copy link
Contributor Author

@andreasnoack , @ViralBShah , @eschnett , @lcw

The requirement came up in the context of some folks trying to use Julia and MPI parallel constructs together on a Cray machine. Apparently TCP connections from the login node to the compute nodes are not allowed.

Actually, at this time, we are not even sure if TCP connections among processes on the compute nodes are allowed - if not allowed we will have to try using MPI itself for Julia transport, but that is a different issue.

With this PR we will have 3 distinct ways of using MPI.jl.

  1. MPI only mode. All processes are separate and not part of a Julia cluster
  2. MPI only on workers. Julia cluster on master + workers.
  3. MPI and Julia on all processes. Uses mpirun's MPMD mode for startup - this PR.

Any thoughts on this?

@eschnett
Copy link
Contributor

eschnett commented Mar 2, 2015

You probably don't mean "login node", but rather the node where your job script runs (the one that calls mpirun); that's usually not one of the login nodes. Given that, do you still need this facility? Are ssh connections from the node that runs mpirun also disallowed? That would surprise me.

How does your mechanism work? I see that you pass different options to the master and the workers. Is the master running on the same nodes as the workers, i.e. are you starting one MPI process "too many"? Or are there one fewer Julia worker processes than MPI processes requested from the queuing system?

@bcumming
Copy link

bcumming commented Mar 2, 2015

By login node, we mean nodes from which mpirun is executed (on Cray systems this is the aprun command). On Cray systems these service nodes are used to compile and launch jobs, and they are referred to as login nodes.

We also have external login nodes, which are configured to look like an internal login node, but which are in a physically different systems, and which launch jobs on the internal login nodes.

In this context, they are referring to the internal login nodes.

@amitmurthy
Copy link
Contributor Author

The terminology came from the folks using the Cray system. The equivalent of mpirun is aprun on the Cray - http://docs.cray.com/books/S-2496-4001/S-2496-4001.pdf

But, yes, the node where aprun will be executed. In this mode, all MPI processes are also part of the Julia cluster. MPI rank 0 is Julia master (i.e. Julia pid 1). Other processes (started with the --worker argument) are both MPI processes as well as Julia workers.

We are still awaiting information about networking restrictions (if any) on the Cray.

@amitmurthy
Copy link
Contributor Author

I think we commented at the same time.

@psanan
Copy link

psanan commented Mar 2, 2015

As mentioned above, the biggest unknown is whether TCP/IP communication is possible between compute nodes on the Cray system. I'm having a difficult time googling this, but perhaps someone at the computing facility itself has a better idea.

These nodes, I believe, run "Compute Node Linux" (CNL).

Here's a 2008 paper that seems to claim that CNL support TCP/IP communication in some sense:
https://cug.org/5-publications/proceedings_attendee_lists/2008CD/S08_Proceedings/pages/Authors/16-19Thursday/Pugmire-Thursday17C/Pugmire-Thursday17C-paper.pdf

Here are some 2013 slides from Cray which imply that TCP/IP can be used alongside the high speed Gemini/Aries networks:
https://cug.org/proceedings/cug2013_proceedings/includes/files/pap105-file2.pdf

@eschnett
Copy link
Contributor

eschnett commented Mar 2, 2015

Thanks for the clarification. Could you also answer my second question?

"How does your mechanism work? I see that you pass different options to the master and the workers. Is the master running on the same nodes as the workers, i.e. are you starting one MPI process "too many"? Or are there one fewer Julia worker processes than MPI processes requested from the queuing system?"

This is just out of curiosity. I completely agree that Julia's current startup mechanisms don't get along well with high-end HPC systems; in this respect, Crays are still easier to use that BG/Qs.

@amitmurthy
Copy link
Contributor Author

"Is the master running on the same nodes as the workers" - Yes

"i.e. are you starting one MPI process 'too many'? Or are there one fewer Julia worker processes than MPI processes requested from the queuing system" - They are the same in number

mpirun -np 1 julia foo-master.jl : -np 4 julia --worker=custom foo-worker.jl will launch a total of 5 processes. Comm world size is 5, In julia nprocs() would return 5, nworkers() would return 4.

The process running foo-master.jl would have MPI rank 0, julia pid 1.

@eschnett
Copy link
Contributor

eschnett commented Mar 2, 2015

@amitmurthy In your example, did you request 4 or 5 MPI processes from the queuing system? That is, are you over-allocating the cores?

@psanan
Copy link

psanan commented Mar 2, 2015

In that example you would need to request 5 cores from the queuing system. On the system we happen to be interested in, which supports hyperthreading, it is possible for these to be virtual cores (two per physical core), so with enough work I think it would be possible to double up the master process with a worker on one of the physical cores. That would probably be premature optimization in many cases, of course.

@eschnett
Copy link
Contributor

eschnett commented Mar 2, 2015

I like the idea of having several execution models like this. I expect that many people will request the feature you implemented. Having said this, I'm unfortunately not (yet?) familiar enough with the cluster manager to review the patch itself..

@amitmurthy amitmurthy changed the title WIP/RFC: Using mpirun's MPMD mode of execution.... RFC: Using mpirun's MPMD mode of execution.... Mar 3, 2015
@amitmurthy amitmurthy changed the title RFC: Using mpirun's MPMD mode of execution.... Using mpirun's MPMD mode of execution.... Mar 3, 2015
@psanan
Copy link

psanan commented Mar 3, 2015

From another exchange with the supercomputer center, we have clarified things a little wrt the availability of TCP/IP on the target system:

  • TCP/IP is not available in normal circumstances between the compute nodes. These run CNL ("Compute Node Linux") which lack a socket API.
  • The hardware does support TCP/IP but the only feasible way to use it is to employ Cray's CCM ("Cluster Compatibility Mode"). This would likely severely degrade performance and defeat the point of trying to run on the cray machine, as using this mode means all kinds of sacrifices, crucially (I believe) not using Cray's optimized interconnect at all (even for MPI).

An avenue that still might be worth pursuing is to implement a custom transport . It was suggested that one possibility for doing this would be to use CCI (https://www.olcf.ornl.gov/center-projects/common-communication-interface/ ), which can sit on top of Cray's GNI (Aries) and would also perhaps make code that we develop in this context reusable across any backends that CCI supports in the future.

@amitmurthy , if this path continued to seem attractive, does it make sense to take the MPI ClusterManager implementation (with the modifications in this PR) and use a new custom transport?

Another piece of information that would be useful is to know how efficient a custom transport layer would need to be. Especially if we are relying on the MPI implementation to do the heavy lifting, would inefficient communication done over TCP/IP likely be performance-limiting? I suspect the answer to this question depends on the granularity and synchronization requirements of the parallel algorithm, so I'll enquire further to that end.

@amitmurthy
Copy link
Contributor Author

Some performance numbers w.r.t. TCP/IP communication in Julia vis-a-vis MPI is being discussed here - JuliaLang/julia#9992

That said we can try out an implementation that uses MPI itself for the Julia cluster setup too. I should have an implementation for this in a few days, and we should be able to get some performance numbers - at least on local machines.

A cursory reading at the CCI link, makes me think that CCI is currently under development and may not be widely used. Am I correct? In that sense having the transport work with the MPI interface seems like a better bet.

@lcw
Copy link
Member

lcw commented Mar 4, 2015

Unfortunately I have not had the time to review the second and third
ways of using MPI.jl so I don't have any comments on this front. I
think we should at least support

  1. MPI only mode. All processes are separate and not part of a Julia cluster
  2. MPI + Julia cluster in some way (I am not sure what the right way
    is)

As far as the communication layer goes. It is not uncommon that on many
large scale clusters the only portable way of using the fast
interconnect is through MPI. So I would avoid posix sockets or any
other type of networking if we are targeting computers on the
Top 500.

@psanan
Copy link

psanan commented Mar 4, 2015

Using MPI itself for all communication is of course an attractive prospect, as Cray has put a lot of work into its own MPI implementation (based on MPICH) relying on its interconnect interface (GNI). It would be great to test out! (I don't understand the demands of the communication layer well enough to know how arduous of a development task this is, though).

I also got the impression that CCI is under active development - I'll be visiting the center in a few days, hopefully, so will try to ask for more info.

@eschnett
Copy link
Contributor

eschnett commented Mar 4, 2015

I was hoping that the new cluster manager of Julia had a way to replace sockets by another functionality. If this was possible, then sending data via MPI would be straightforward. We can create a new communicator (so there would be no conflict), and there are already MPI functions to send arbitrary (serializable) data without much fuss. One can then either have one waiting receiver thread per other process, or one thread that waits for a message from any other process.

Last time I looked, this seemed quite possible. The only difficulty was the startup process, since a cluster manager wants to start the remote processes via ssh, whereas MPI relies on mpirun and cannot interactively add new processes.

@andreasnoack
Copy link
Member

@eschnett What about MPI_SPAWN?

@eschnett
Copy link
Contributor

eschnett commented Mar 5, 2015

@andreasnoack: On an HPC system, a job receives an allocation with a certain number of cores. You can start additional MPI processes, but these will have to share the same cores that you are already using.

The above is the (policy) limitation to which I was referring; at the technical level, as you say, MPI offers intercommunicators to allow increasing the number of processes.

@psanan
Copy link

psanan commented Mar 5, 2015

@eschnett, I was wondering about that point: how would the threading would be achieved? In our case, hyperthreading is available, so running a receiver thread per worker MPI process might be quite efficient.

@amitmurthy amitmurthy changed the title Using mpirun's MPMD mode of execution.... WIP: Using mpirun's MPMD mode of execution. MPI as a transport option. Mar 5, 2015
@amitmurthy amitmurthy force-pushed the amitm/mpmd branch 2 times, most recently from 02f21c5 to e019a35 Compare March 5, 2015 14:44
@amitmurthy
Copy link
Contributor Author

Current status of MPI as transport:

Good news : It works!
Not so good news:

  • The speed of data transfer is around 2 times slower than TCP for small arrays. Slower by 10-20 times for large arrays.
nloops = 10^3
function foo(n)
    a=ones(n)
    remotecall_fetch(2, x->x, a);

    @elapsed for i in 1:nloops
        remotecall_fetch(2, x->x, a)
    end
end

n=10^3
foo(1)
t=foo(n)
println("$t seconds for $nloops loops of send-recv of array size $n")

n=10^6
foo(1)
t=foo(n)
println("$t seconds for $nloops loops of send-recv of array size $n")

MPI Transport:

0.18172557 seconds for 1000 loops of send-recv of array size 1000
38.380218969 seconds for 1000 loops of send-recv of array size 1000000

TCP Transport:

0.091442312 seconds for 1000 loops of send-recv of array size 1000
2.568867051 seconds for 1000 loops of send-recv of array size 1000000

I should be able to fix the large array timing issue, but I don't think I can make the MPI timings faster than TCP.

  • The second issue is that integration of MPI calls with the libuv is a bit problematic. Julia being single-threaded, if a process is waiting on a MPI.recv, nothing else will be executing in that process. The current implementation does an inefficient busy-wait around MPI.irecv with a yield . I don't know how to work around this till the the time we have multi-threading in Julia. The effect of this currently is that processes will consume 100% CPU even when they are waiting for computation/communication.
  • The third(minor) issue is the printing of some warning messages upon exit. This is due to the shutdown process of the cluster requiring the communication layer being available, but once we call MPI_Finalize(), the comm layer becomes effectively unavailable.

@psanan
Copy link

psanan commented Mar 5, 2015

Awesome!

I can run the new example out of the box on my laptop:

[examples (amitm/mpmd=)]$ mpirun -np 1 $JULIA 07-mpi-master.jl : -np 4 $JULIA --worker=custom 07-mpi-worker.jl
0.414936563 seconds for 1000 loops of send-recv of array size 1000
58.243897587 seconds for 1000 loops of send-recv of array size 1000000
EXAMPLE: HELLO
Hello world, I am 0 of 5
Hello world, I am 4 of 5
....

I attempted to run on the cray machine as well, and have run into a couple of the limitations of aprun's MPMD mode:

  • Apparently MPMD mode with aprun can only load one executable per compute node. This is annoying since it means that an entire compute node (24 cores here) must be devoted to the master process, but there are many compute nodes on the machine, so losing most of one is probably acceptable.
  • System commands cannot be used with aprun's MPMD mode. This is preventing me from testing further at the moment because we have been wrapping the julia executable in a shell script as a workaround to an unlocated lib/julia/sys.ji, but I suspect there is a way around that.
aprun.x: Apid 913941: Commands are not supported in MPMD mode
aprun.x: Apid 913941: Exiting due to errors. Application aborted

I will need to rebuild julia to try and test this further (since I'm using a version preceding the addition of --worker=custom).

@eschnett
Copy link
Contributor

eschnett commented Mar 5, 2015

@psanan: If you have real threading (which Julia doesn't have yet), then running one MPI thread per network card should suffice, i.e. one MPI worker thread per node. Which configuration is most efficient -- reserving a core, running on a hyperthread, having one additional thread per node, etc. is probably system-dependent.

@amitmurthy: It is likely that MPI requires a "progress thread" to handle sends and receives. It is also the case that most MPI implementations are not thread-safe, so that one cannot call MPI_Send from one thread while another is stuck in MPI_Recv; even if so, this may not be implemented efficiently. The frameworks I know use MPI_Iprobe and yield in a loop. Yes, this runs the core at 100% even if it is otherwise idle.

@amitmurthy: Why do you need different executables? You could run the same executable, and within the executable call MPI_Comm_rank to determine whether this is the root process or not.

@amitmurthy
Copy link
Contributor Author

irecv does a IProbe, so I guess the busy-wait is similar to other frameworks then. With Julia multi-threading, we can ensure that all MPI calls are made from a single MPI-communication-only thread with an IPC mechanism between the MPI-thread and other threads, but since julia multi-threading is quite a way out, IProbe/yield will have to do till then.

We don't need different executables, but different arguments. The julia processes need to know if they are running as a "master" or a "worker". Different command-line arguments specify this. Currently this needs to be done as part of the process initialization (before any modules are loaded) and hence using MPI rank is not feasible.

I'll see if I can submit a patch against Julia 0.4, which will make it possible for a Julia "master" process to become a "worker" after all initialization is complete.

@psanan
Copy link

psanan commented Mar 6, 2015

Julia is rebuilt, so I can now run all the usual examples on the cray machine, and woohoo, the MPI-only ClusterManager works!
(For anyone coming across this thread in the future, if you are having problems where sys.ji is not found, this can be remedied by setting JULIA_HOME to /path/to/julia/root/usr/bin in the environment).

Here we run on two nodes:

$ aprun -n 1 ~/julia-dev/julia ~/.julia/v0.4/MPI/examples/07-mpi-master.jl : -n 24 ~/julia-dev/julia --worker=custom ~/.julia/v0.4/MPI/examples/07-mpi-worker.jl
0.577344311 seconds for 1000 loops of send-recv of array size 1000
31.549359867 seconds for 1000 loops of send-recv of array size 1000000
EXAMPLE: HELLO
Hello world, I am 0 of 25
Hello world, I am 24 of 25
Hello world, I am 18 of 25Hello world, I am 17 of 25Hello world, I am 20 of 25
....

@amitmurthy
Copy link
Contributor Author

Awesome! Once JuliaLang/julia#10419 is merged, we will not need MPMD and will be able to use all cores on all nodes.

I'll put in a fix for speeding up data transfer times over the next couple of days

@psanan
Copy link

psanan commented Mar 6, 2015

Tested on three compute nodes as well:

$ aprun -n 1 ~/julia-dev/julia ~/.julia/v0.4/MPI/examples/07-mpi-master.jl : -n 48 ~/julia-dev/julia --worker=custom ~/.julia/v0.4/MPI/examples/07-mpi-worker.jl
0.628255388 seconds for 1000 loops of send-recv of array size 1000
31.685103334 seconds for 1000 loops of send-recv of array size 1000000
EXAMPLE: HELLO
Hello world, I am 0 of 49
....

@amitmurthy
Copy link
Contributor Author

OK. mpmd is no longer required and MPI-only transport is now around 2-3 times slower than TCP. I'll work on removing this difference.

You can try it out once JuliaLang/julia#10453 Julia is merged.

@eschnett
Copy link
Contributor

eschnett commented Mar 9, 2015

My naive guess would be that the serialization in MPI creates overhead. If you already know that you are sending e.g. an array of characters, then you can skip the serialization call and instead just call MPI_Isend.

Also -- are you using an irecv that waits for ANY_SOURCE, or are you using one irecv per other process?

@amitmurthy
Copy link
Contributor Author

These are the timings for Julia parallel constructs functions like pmap, @parallel, remotecall*, etc, using MPI only for message transport instead of TCP/IP sockets.

Regular MPI code will be faster.

So, the serialisation overhead is on the Julia side - the request gets packed with a small identifier and the serialisation format embeds the types, etc. The MPI function for sending the serialized data is MPI_Isend.

irecv is used with ANY_SOURCE

@eschnett
Copy link
Contributor

eschnett commented Mar 9, 2015

This serialization overhead would then be the same for both TCP/IP and MPI, unless the MPI implementation does something wrong.

The timings graph doesn't show the units, and doesn't say what it shows. Naively, I'd assume it shows seconds (since it is called "timings"), and then MPI is already faster than TCP/IP. It also makes sense that shared memory ("sm") is faster than using the loopback device. If MPI is implemented as user-space library (it probably is) and TCP/IP needs to make a kernel round-trip, then this also explains the performance difference.

@amitmurthy amitmurthy changed the title WIP: Using mpirun's MPMD mode of execution. MPI as a transport option. Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option. Mar 10, 2015
@amitmurthy
Copy link
Contributor Author

Merging this for now. Will open separate issues to track further improvements.

amitmurthy added a commit that referenced this pull request Mar 10, 2015
Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option.
@amitmurthy amitmurthy merged commit 0682251 into master Mar 10, 2015
amitmurthy referenced this pull request in JuliaLang/julia Mar 10, 2015
remove argument for option --worker , export init_worker
@simonbyrne simonbyrne deleted the amitm/mpmd branch May 3, 2019 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants