-
-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option. #38
Conversation
@andreasnoack , @ViralBShah , @eschnett , @lcw The requirement came up in the context of some folks trying to use Julia and MPI parallel constructs together on a Cray machine. Apparently TCP connections from the login node to the compute nodes are not allowed. Actually, at this time, we are not even sure if TCP connections among processes on the compute nodes are allowed - if not allowed we will have to try using MPI itself for Julia transport, but that is a different issue. With this PR we will have 3 distinct ways of using MPI.jl.
Any thoughts on this? |
You probably don't mean "login node", but rather the node where your job script runs (the one that calls How does your mechanism work? I see that you pass different options to the master and the workers. Is the master running on the same nodes as the workers, i.e. are you starting one MPI process "too many"? Or are there one fewer Julia worker processes than MPI processes requested from the queuing system? |
By login node, we mean nodes from which mpirun is executed (on Cray systems this is the We also have external login nodes, which are configured to look like an internal login node, but which are in a physically different systems, and which launch jobs on the internal login nodes. In this context, they are referring to the internal login nodes. |
The terminology came from the folks using the Cray system. The equivalent of mpirun is aprun on the Cray - http://docs.cray.com/books/S-2496-4001/S-2496-4001.pdf But, yes, the node where aprun will be executed. In this mode, all MPI processes are also part of the Julia cluster. MPI rank 0 is Julia master (i.e. Julia pid 1). Other processes (started with the --worker argument) are both MPI processes as well as Julia workers. We are still awaiting information about networking restrictions (if any) on the Cray. |
I think we commented at the same time. |
As mentioned above, the biggest unknown is whether TCP/IP communication is possible between compute nodes on the Cray system. I'm having a difficult time googling this, but perhaps someone at the computing facility itself has a better idea. These nodes, I believe, run "Compute Node Linux" (CNL). Here's a 2008 paper that seems to claim that CNL support TCP/IP communication in some sense: Here are some 2013 slides from Cray which imply that TCP/IP can be used alongside the high speed Gemini/Aries networks: |
Thanks for the clarification. Could you also answer my second question? "How does your mechanism work? I see that you pass different options to the master and the workers. Is the master running on the same nodes as the workers, i.e. are you starting one MPI process "too many"? Or are there one fewer Julia worker processes than MPI processes requested from the queuing system?" This is just out of curiosity. I completely agree that Julia's current startup mechanisms don't get along well with high-end HPC systems; in this respect, Crays are still easier to use that BG/Qs. |
"Is the master running on the same nodes as the workers" - Yes "i.e. are you starting one MPI process 'too many'? Or are there one fewer Julia worker processes than MPI processes requested from the queuing system" - They are the same in number
The process running foo-master.jl would have MPI rank 0, julia pid 1. |
@amitmurthy In your example, did you request 4 or 5 MPI processes from the queuing system? That is, are you over-allocating the cores? |
In that example you would need to request 5 cores from the queuing system. On the system we happen to be interested in, which supports hyperthreading, it is possible for these to be virtual cores (two per physical core), so with enough work I think it would be possible to double up the master process with a worker on one of the physical cores. That would probably be premature optimization in many cases, of course. |
I like the idea of having several execution models like this. I expect that many people will request the feature you implemented. Having said this, I'm unfortunately not (yet?) familiar enough with the cluster manager to review the patch itself.. |
From another exchange with the supercomputer center, we have clarified things a little wrt the availability of TCP/IP on the target system:
An avenue that still might be worth pursuing is to implement a custom transport . It was suggested that one possibility for doing this would be to use CCI (https://www.olcf.ornl.gov/center-projects/common-communication-interface/ ), which can sit on top of Cray's GNI (Aries) and would also perhaps make code that we develop in this context reusable across any backends that CCI supports in the future. @amitmurthy , if this path continued to seem attractive, does it make sense to take the MPI ClusterManager implementation (with the modifications in this PR) and use a new custom transport? Another piece of information that would be useful is to know how efficient a custom transport layer would need to be. Especially if we are relying on the MPI implementation to do the heavy lifting, would inefficient communication done over TCP/IP likely be performance-limiting? I suspect the answer to this question depends on the granularity and synchronization requirements of the parallel algorithm, so I'll enquire further to that end. |
Some performance numbers w.r.t. TCP/IP communication in Julia vis-a-vis MPI is being discussed here - JuliaLang/julia#9992 That said we can try out an implementation that uses MPI itself for the Julia cluster setup too. I should have an implementation for this in a few days, and we should be able to get some performance numbers - at least on local machines. A cursory reading at the CCI link, makes me think that CCI is currently under development and may not be widely used. Am I correct? In that sense having the transport work with the MPI interface seems like a better bet. |
Unfortunately I have not had the time to review the second and third
As far as the communication layer goes. It is not uncommon that on many |
Using MPI itself for all communication is of course an attractive prospect, as Cray has put a lot of work into its own MPI implementation (based on MPICH) relying on its interconnect interface (GNI). It would be great to test out! (I don't understand the demands of the communication layer well enough to know how arduous of a development task this is, though). I also got the impression that CCI is under active development - I'll be visiting the center in a few days, hopefully, so will try to ask for more info. |
I was hoping that the new cluster manager of Julia had a way to replace sockets by another functionality. If this was possible, then sending data via MPI would be straightforward. We can create a new communicator (so there would be no conflict), and there are already MPI functions to send arbitrary (serializable) data without much fuss. One can then either have one waiting receiver thread per other process, or one thread that waits for a message from any other process. Last time I looked, this seemed quite possible. The only difficulty was the startup process, since a cluster manager wants to start the remote processes via ssh, whereas MPI relies on mpirun and cannot interactively add new processes. |
@eschnett What about |
@andreasnoack: On an HPC system, a job receives an allocation with a certain number of cores. You can start additional MPI processes, but these will have to share the same cores that you are already using. The above is the (policy) limitation to which I was referring; at the technical level, as you say, MPI offers intercommunicators to allow increasing the number of processes. |
@eschnett, I was wondering about that point: how would the threading would be achieved? In our case, hyperthreading is available, so running a receiver thread per worker MPI process might be quite efficient. |
02f21c5
to
e019a35
Compare
Current status of MPI as transport: Good news : It works!
MPI Transport:
TCP Transport:
I should be able to fix the large array timing issue, but I don't think I can make the MPI timings faster than TCP.
|
Awesome! I can run the new example out of the box on my laptop:
I attempted to run on the cray machine as well, and have run into a couple of the limitations of aprun's MPMD mode:
I will need to rebuild julia to try and test this further (since I'm using a version preceding the addition of |
@psanan: If you have real threading (which Julia doesn't have yet), then running one MPI thread per network card should suffice, i.e. one MPI worker thread per node. Which configuration is most efficient -- reserving a core, running on a hyperthread, having one additional thread per node, etc. is probably system-dependent. @amitmurthy: It is likely that MPI requires a "progress thread" to handle sends and receives. It is also the case that most MPI implementations are not thread-safe, so that one cannot call @amitmurthy: Why do you need different executables? You could run the same executable, and within the executable call |
We don't need different executables, but different arguments. The julia processes need to know if they are running as a "master" or a "worker". Different command-line arguments specify this. Currently this needs to be done as part of the process initialization (before any modules are loaded) and hence using MPI rank is not feasible. I'll see if I can submit a patch against Julia 0.4, which will make it possible for a Julia "master" process to become a "worker" after all initialization is complete. |
Julia is rebuilt, so I can now run all the usual examples on the cray machine, and woohoo, the MPI-only ClusterManager works! Here we run on two nodes:
|
Awesome! Once JuliaLang/julia#10419 is merged, we will not need MPMD and will be able to use all cores on all nodes. I'll put in a fix for speeding up data transfer times over the next couple of days |
Tested on three compute nodes as well:
|
OK. mpmd is no longer required and MPI-only transport is now around 2-3 times slower than TCP. I'll work on removing this difference. You can try it out once JuliaLang/julia#10453 Julia is merged. |
My naive guess would be that the serialization in MPI creates overhead. If you already know that you are sending e.g. an array of characters, then you can skip the serialization call and instead just call Also -- are you using an |
These are the timings for Julia parallel constructs functions like Regular MPI code will be faster. So, the serialisation overhead is on the Julia side - the request gets packed with a small identifier and the serialisation format embeds the types, etc. The MPI function for sending the serialized data is
|
This serialization overhead would then be the same for both TCP/IP and MPI, unless the MPI implementation does something wrong. The timings graph doesn't show the units, and doesn't say what it shows. Naively, I'd assume it shows seconds (since it is called "timings"), and then MPI is already faster than TCP/IP. It also makes sense that shared memory ("sm") is faster than using the loopback device. If MPI is implemented as user-space library (it probably is) and TCP/IP needs to make a kernel round-trip, then this also explains the performance difference. |
Merging this for now. Will open separate issues to track further improvements. |
Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option.
remove argument for option --worker , export init_worker
....to have all processes part of both the MPI as well as Julia clusters.