Distributed.jl vs MPI.jl

Hi all,

I am relatively new to Julia, but have ported one of my older programs from Fortran to Julia. I picked a program from my collection which doesn’t have any parallelims in it with the intention to implement it from scratch, in a Julian way).

So here I am, trying to speed the code up in Julia. I know that there is standard package Distributed.jl in Julia which should allow me to do exactly what I wanted: parallelize the code in Julian way. (Whether using a Julia standard package makes an approach Julian is a different question and I would leave it aside for the sake of shortness.) However, there is also MPI.jl which seems to offer the same possibility as Distributed.jl, although I would see it as a step back, given that MPI as a library was needed by languages developed before parallel computers, unaware of the parallel environment in which they would eventually end up running. But Julia is not like that, Julia was designed with inherent capability to run on multiple processing units.

So I wonder, what was the motivation to introduce MPI into Julia through a package, when it already has the same or very similar functionality in Distributed.jl? Did anyone try both approaches and what were your experiences? Does Distributed.jl lack some functionality present in MPI.jl? Does using MPI.jl brings more potential for better performance? Does any of these libraries renders your code less platform independent? Since my long term aim is to run simulations on multiple GPUs, is any of these package easier to integrate with CUDA.jl? I am pretty sure that CUDA.jl is not aware of MPI, but it might be of Distributed.jl

Any comments or opinions would be very welcome.

Cheers

3 Likes

Distributed.jl is the method to spawn new processes in separate memory spaces, either on your local machine, or any machine to which you can connect (by default via ssh). Data and code must be explicitly shared in order for it to run. So any .jl files or Julia packages need to be available to remote workers through their own filesystem.

Threads is for launching threads with shared memory, so you need to think about locking and races, but saves copying data to workers.

MPI.jl is just a wrapper around the C MPI library it is not native Julia code.

So, if you want to go full Julia, Distributed & Threads is the way to go.

If you have a 1Gb dataset you want to process in parallel, using 4 x Distributed procs will mean making 3 x 1Gb copies (+ 3x the memory used by the Julia runtime) and then processing it and collating the results.

Using 4 Threads means no copying but also making sure they don’t stomp over each other’s data.

Distributed can use any Julia construct, so yes to Cuda. Just imagine ID1 is you and you’re running two REPL windows at the same time, one of which might be via ssh on a remote machine - that’s Distributed.jl

1 Like

MPI.jl, as a wrapper, can use MPI collective communications that may be optimized for the interconnection network of a given cluster.

8 Likes

Unless someone proves it wrong my rule of thumb for distributed computing with Julia is: Distributed.jl is nice and convenient for small(ish) parallelization but if you really care about performance and want to scale things up to HPC cluster level MPI.jl is the way to go.

(I’d be really curious to see a well thought out benchmark of Distributed.jl vs MPI.jl on cluster scale.)

4 Likes

Aaarggh… I had access to a 120 server cluster with Infiniband for the last few months… I could have run those benchmarks. No longer in the testing phase, sorry.

I work for Dell - however to request benchmarking time we would have to have some commercial justification for our bigger clusters.

FWIW, I have (privileged) access to clusters (with Omni-path / Infiniband) and could run a benchmark across O(100) nodes. However, we’d need a well thought-out benchmark. I just haven’t had the time to create one.

2 Likes

@carstenbauer Hi! How about a benchmark based on AlphaZero.jl? What would you think about such an idea? I am very new to HPC topics - I just thought that this is a very interesting package - I think closely related to topics at the forefront of scientific developments.

Just wondering, as for the MPI, how does it look with regard to Gasp.jl? Do I have to rewrite the code of the package to be able to use it?

(Not sure that I have understood the question. )
It seems that Gasp.jl is already built upon MPI.

A Julia version of hpcg would make a lot of sense to the the HPC community but it would take some time to code…

2 Likes

I have conceptual curiosity here: what can one distributed architecture (like MPI) offer that may be much better than another (say Distributed.jl) for a massively parallel calculation? The effectiveness of such calculations is very much dependent on structuring the data to avoid as much as possible the communication between tasks, I cannot imagine that the overheads of one or other message passing interface would be critical for these applications. (or at least should be important in some limiting cases, but not in general).

(my personal experience using MD simulation packages - using MPI or charmrun, for example - is that unless the systems are huge enough such that communication is small relative to what each node has to compute, scaling is very poor whatever the interface used)

Not sure to follow but scaling will eventually be limited by communication efficiency. For example a dot product inside a large Krylov solver will probably be implemented with a MPI reduction which can have a specific implementation on a given network (depending on the network topology for instance).

It is quite possible that a lower level network interface exists and is used by Charm++ and MPI (but it is only speculation since I know nothing about this) and could also be used by DA .

1 Like

Right. Sorry. Mental leap. What I had on my mind. For example I have a package like AlphaZero.jl that is quite demanding. I have a machine with good network fabric and MPI 3 RMA. How do I use Gasp.jl? Do I have to rewrite the code of AlphaZero.jl in order to run it with Gasp.jl on several nodes? Is there any potential advantage to be expected when running it with Gasp.jl vs Distributed.jl?

Demanding on the communications (latency, bandwidth) ?
I can’t answer your question but if the multi-node scaling of AlphaZero.jl is limited by the network performance, I would make a simple test (again a scalar product on two large arrays distributed on all nodes would be sufficient) using both approaches.

1 Like

Relevant to this discussion:
A Look at Communication-Intensive Performance in Julia.
They didn’t really look at Distributed.jl, but reported decent results for a Julia port of HPCG. The source code will apparently be made public, pending acceptance of their paper.

8 Likes

I’m looking forward to look at the source code and run similar experiments on our clusters. I really wonder where these observed overheads for small problem sizes come from (if they still exist in Julia 1.7).

I wonder if it’s compilation, for N_* = 16 and 32 they report similar times (with 8 MPI nodes N = 32 is faster than N = 16). The fact that Julia outperforms the C++ implementation at larger domain sizes is exciting, but would be good to understand exactly why.

1 Like

In my experience Distributed.jl is terrible at parallel mapreduce if large arrays are communicated across workers on multiple nodes, as the reduction step happens serially on the master process. MPI is excellent there, as the reduction is in parallel.

2 Likes

I don’t think Distributed.jl’s mapreduce requires serial communication with the master process? See Distributed Computing · The Julia Language

The specified range is partitioned and locally executed across all workers. In case an optional reducer function is specified, @distributed performs local reductions on each worker with a final reduction on the calling process.

I am not an expert in this field, really, that’s why I took part in this discussion, hoping to get some guidance related to Gasp.jl and such an interesting package as AlphaZero.jl. However, if I would have to provide the answer I would say: in every field with exception of I/O. Please take a look at the video by Dr. Matt Bauman (the part about AlphaZero starts at 29:45) [https://youtu.be/80jBqGqKW6c]. It took him about 2 hours to train the agent, all 15 iterations, on 80 cores and 8 x V100s. I was hoping to beat him, possibly using Gasp.jl as I somehow came by it. However I do not fully understand this technology and it seems there is no tutorial about it or at least I was not able to find it.

During my own experiments (on single servers) the highest RAM utilization I noticed was at about 320GB, training of only 1 iteration (out of 15) took about 22h on 2 x Platinum 8153. The shortest I was able to achieve on CPUs so far was on 2 x Gold 6348 about 8h. However, it should be noted that in this case 1st part of the iteration (there are 4) was really pretty short, took “only” about 1h, but then next stages (2-4) were running on only 1 core, thus such a long overall time. The training time for stage 1 on 2 x Gold 6348 was comparable to the time achieved on 2 x AMD EPYC 7543 and 2 x V100s (in this case only 1 GPU was utilized at about 30%, GPU memory footprint at ~32GB, RAM footprint at ~20GB). However, in case of GPUs parts 2 to 4 in comparison to experiments on CPUs were ultra fast.

Well, yeah, however, how to do it. It seems that the documentation on Gasp.jl is rather very advanced.

EDIT: In general, what I was wondering: is Gasp.jl a right path to explore in relation to AlphaZero.jl?