Distributed.jl vs MPI.jl

They are somewhat different models of parallelism:

MPI.jl is based on a single-program multiple-data (SPMD) model, where (typically) all processes run the same program and periodically perform some communication operation to other processes.

As a very well-established paradigm, lots of time and money have been spent optimizing these operations, both in terms of hardware (MPI is able to leverage fast interconnects, such as Infiniband), as well as software support (it ultimately underlies almost all HPC codebases, either directly or indirectly via libraries like gasp mentioned above). If you have a CUDA-aware MPI installation, MPI.jl is able to use CuArrays directly for communication (which can leverage specialized hardware).

Downsides of MPI:

  • It can be a pain to set up: we’ve put a lot of effort into making MPI.jl easy to install and use and should work out of the box on your local machine, but if you want to set it up on a cluster it can still take a bit of effort.
  • Both sides of the communication typically need to know how much data is being sent (or, equivalently, you need a separate communication operation to communicate how much data to send). Despite this inflexibility, it turns out to be a reasonable approach for a lot of scientific problems such as solving PDEs, where you are typically communicating arrays of floating point numbers at regular intervals.
  • It is very much oriented toward batch jobs: MPI jobs have to be launched via a special program (typically called mpiexec or similar) that will start and connect all the processes. This makes it difficult to use interactively (which is one of the nice features of Julia).

Distributed.jl is based on remote procedure calls, and is typically written by having a main process which can then evaluate functions on worker processes. This is very flexible, and more complicated systems can be built on top of it, such as Dagger.jl. It is very easy to set up, and can be used interactively. It plays nice with CUDA.jl.

The key downsides:

  • By default, processes communicate via sockets, which can be slower and have higher-latency than specialized interconnects. You can use MPI as a communication layer, but then you lose the interactivity.
  • There is additional overhead to serialization of data at each end.

My personal advice would generally to start with Distributed.jl: if you’re only running on a single machine (even one with multiple GPUs), this would most likely be sufficient. If you want to scale to 1000s of nodes, then you will likely need MPI at some point (though I would love to be proven wrong). If you’re somewhere in the middle, it really depends on the specific problem you’re trying to solve, and what hardware you have available.

22 Likes

I am providing a link to a video “JuliaCon 2018 | Parallel Computing with MPI-3 RMA and Julia | Bart Janssens” which is not directly about Gasp.jl however is related to one sided RMA. JuliaCon 2018 | Parallel Computing with MPI-3 RMA and Julia | Bart Janssens - YouTube. (Thanks for all the information - interesting thread.)

2 Likes

I think what that means is that the chunks on each worker are reduced locally, but reduction across workers seems to happen in serial:

function preduce(reducer, f, R)
    chunks = splitrange(Int(firstindex(R)), Int(lastindex(R)), nworkers())
    all_w = workers()[1:length(chunks)]


    w_exec = Task[]
    for (idx,pid) in enumerate(all_w)
        t = Task(()->remotecall_fetch(f, pid, reducer, R, first(chunks[idx]), last(chunks[idx])))
        schedule(t)
        push!(w_exec, t)
    end
    reduce(reducer, Any[fetch(t) for t in w_exec])
end

The final reduction happens in serial. This has serious performance issues. I had created a package ParallelUtilities.jl to get around this, which performs the reduction in parallel using binary trees. This performs better than distributed for loops, but still is no match for MPI.

Running the following on a Slurm cluster using 2 nodes with 28 cores on each leads to:

julia> using Distributed

julia> using ParallelUtilities

julia> @everywhere f(x) = ones(10_000, 1_000);

julia> A = @time @distributed (+) for i=1:nworkers()
                f(i)
            end;
 22.637764 seconds (3.35 M allocations: 8.383 GiB, 16.50% gc time, 0.09% compilation time)

julia> B = @time pmapreduce(f, +, 1:nworkers());
  2.170926 seconds (20.47 k allocations: 77.117 MiB)

julia> A == B
true

Distributed for loops, as documented in all fairness, performs better when the sizes of arrays are small, but horribly if they are large. MPI is by far the fastest, given that it’s highly optimized to make use of inter-node communication on clusters.

1 Like

My understanding from their paper is that they run their codes several times to avoid exactly that issue.

Both this paper and one of their references (Benchmarking Julia’s Communication Performance: Is Julia HPC ready or Full HPC? | IEEE Conference Publication | IEEE Xplore) hint that there seems to be some issue in how communication between workers is handled, e.g., when comparing MPI.jl and its C equivalent. @simonbyrne Would you be able to comment on this?

Are you talking about the Allreduce times? That seems odd, especially since it was only on one system. Do you know if the code they used is available somewhere?

For the first paper, I don’t think so, as the authors are waiting for the paper to get accepted. For the second paper, here are the links to the repos, mentioned in the appendix:

1 Like