They are somewhat different models of parallelism:
MPI.jl is based on a single-program multiple-data (SPMD) model, where (typically) all processes run the same program and periodically perform some communication operation to other processes.
As a very well-established paradigm, lots of time and money have been spent optimizing these operations, both in terms of hardware (MPI is able to leverage fast interconnects, such as Infiniband), as well as software support (it ultimately underlies almost all HPC codebases, either directly or indirectly via libraries like gasp mentioned above). If you have a CUDA-aware MPI installation, MPI.jl is able to use CuArrays directly for communication (which can leverage specialized hardware).
Downsides of MPI:
- It can be a pain to set up: we’ve put a lot of effort into making MPI.jl easy to install and use and should work out of the box on your local machine, but if you want to set it up on a cluster it can still take a bit of effort.
- Both sides of the communication typically need to know how much data is being sent (or, equivalently, you need a separate communication operation to communicate how much data to send). Despite this inflexibility, it turns out to be a reasonable approach for a lot of scientific problems such as solving PDEs, where you are typically communicating arrays of floating point numbers at regular intervals.
- It is very much oriented toward batch jobs: MPI jobs have to be launched via a special program (typically called
mpiexec
or similar) that will start and connect all the processes. This makes it difficult to use interactively (which is one of the nice features of Julia).
Distributed.jl is based on remote procedure calls, and is typically written by having a main process which can then evaluate functions on worker processes. This is very flexible, and more complicated systems can be built on top of it, such as Dagger.jl. It is very easy to set up, and can be used interactively. It plays nice with CUDA.jl.
The key downsides:
- By default, processes communicate via sockets, which can be slower and have higher-latency than specialized interconnects. You can use MPI as a communication layer, but then you lose the interactivity.
- There is additional overhead to serialization of data at each end.
My personal advice would generally to start with Distributed.jl: if you’re only running on a single machine (even one with multiple GPUs), this would most likely be sufficient. If you want to scale to 1000s of nodes, then you will likely need MPI at some point (though I would love to be proven wrong). If you’re somewhere in the middle, it really depends on the specific problem you’re trying to solve, and what hardware you have available.