I’m developing an algorithm using Distributed.jl and DistributedArrays.jl, and I’m supposed to run this algorithm on a cluster supercomputer. I’ve recently been told that it uses Infiniband to connect internal nodes with each other. It’s the first time I do parallel computing, and I’ve never heard about Infiniband before, so I wanted to ask if I can keep using DistributedArrays.jl or if I have to switch to MPI.jl (which I would avoid, if possible). Sorry if this question is a bit nooby and vague, but I’m a bit lost. Searching online gave different answers and in general it’s not something that is a lot talked about online. And as I’ve said, it’s my first time dealing with these kind of things.
I’m in a similar situation: I’m starting to look at deploying our simulations in clusters with several nodes connected by infiniband, so I look forward to any response from people with this kind of experience.
(As far as I understand, Infiniband is much like good old ethernet, just much faster, so you can reduce the bottleneck of internode communication.)
Possibly relevant: Running Julia in a SLURM Cluster - #2 by CameronBieganek
It looks like you’ll be a bit of a pioneer: Custom transport for Distributed.jl to utilize Infiniband and avoid MPI?
UCX.jl seems to be the leading choice here, but it hasn’t gotten a new release in over a year, although several PRs appear to be pending. cc @vchuravy
If this is your first time doing parallel computing, it’s probably easiest to bite the bullet and use MPI rather than acting as a test dummy for a new protocol.
Perhaps you could also consider PartitionedArrays.jl which uses MPI as a backend.
That’s a pity. Looking online I did found people using Julia with IPoIB, which why I asked if it was possible. Than again, knowing nothing (yet) about the clsuter I’m about to work on, I’m not sure if IPoIB even applies.
You can use MPIClusterManager.jl to run Distributed.jl over MPI.
My long-term plan is to build something akin to Distributed.jl using UCX as the communication layer.
But there is no ETA on that 
Why do you need to use MPI? I’d say try with Distributed.jl and DistributedArrays.jl and use a cluster manager (say, ClusterManagers.jl). I am assuming your cluster has some workload management software already installed (e.g., Slurm, PBS).
My cluster is also connected with Infiniband, but to me that is just some backend magic to connect the nodes together. Instead, my workflow is simply to log in to my head node, start Julia, and use ClusterManagers to connect to the remaining nodes.
That is what I thought too, but I wasn’t sure if using InfiniBand would have been a problem. I’ve never used a ClusterManager tho, I wouldn’t even know which one to choose.
Do you know what you have installed on your cluster? Your path to parallelization could be as simple as
using ClusterManagers
addprocs(SlurmManager(n)) 
pmap(1:100) do 
   #code here
end
which will start distribute the code in the do block over n nodes. The pmap function will manage scheduling, allocation, and return your results in an array for you.
Note I am talking about embarrassingly parallel problems. Multithreading is a whole different beast.
I still have to have access to the cluster. I’ll be sure to be back here once I know more. It’s good to know that I don’t necessarily need to use MPI.jl. I wanted to know if I had to rework my algorithm from scratch using MPI or not. Thank you.
This pointer is really useful @affans, thanks! But I cannot give you more than one “heart” 
Distributed.jl and DistributedArrays.jl are not the only possibility if you do not want to programm using MPI.jl. There are high-level packages that build on MPI.jl which render distributed parallelization easier, e.g. ImplicitGlobalGrid.jl:
High-level packages have of course often a certain focus and not each of them might be suited for your use case, which you have not mentioned though.
A Julia application that uses MPI.jl - whether directly or via a high-level package - can be submitted on a cluster pretty much the same way as any other MPI application, that might be written in C/C++/Fortran/… See here for an example on how you can run Julia applications, e.g., on the Piz Daint supercomputer at the Swiss National Supercomputing Centre:
https://user.cscs.ch/tools/interactive/julia/#how-to-run-on-piz-daint
In the case of embarrassingly parallel problems isn’t it better to have SLURM (or whatever) launch several julia processes?
In cluster, small jobs usually have higher priority, so it is usually better to launch many small jobs (each one starting julia an then working on a piece of the 1:100) than having one big job (which starts julia an then works on the whole 1:100).
This may not be the most optimal workflow. In particular, Distributed doesn’t offer a tree-based reduce, which might make mapreduce operations (ie. distributed for loops) seriously slow compared to MPI. It’s really hard to compare with MPI if one needs to transfer large data across cores.
A lot of people use Slurm batch files to launch Julia processes, but I am a big fan of doing everything within Julia using ClusterManagers (which internally builds an srun command). My addprocs for example looks like
addprocs(SlurmManager(500), N=17, topology=:master_worker, exeflags="--project=.")
launching 500 worker processes across 17 nodes, after which I just use pmap to launch my long-running functions on each of the processes.
I see, I don’t know much about tree-based reduce. For my work, pmap returns an array of values from each of the worker processes and I do my own reduce operations on it.
What does tree-based reduce mean?
Since we’re talking about use cases, I have to construct a 3D or 4D lattice of arbitary side length, and divide it into chunks to be processed by each worker. Each point is processed knowing its current value and the values of the nearest neighbors. In fact, in my implementation I don’t use pmap, I launch every process once and @sync them. I do this for a bunch of loops, making sure that the points each process is working on is totally independent and doesn’t require locking the array. The only exchange between processes is when neighbors happen to be on a part of the lattice owned by another process, which DistributedArrays.jl handles automatically for me simply indexing the global array object.
Hi @ultrapoci, I met a similar problem. I use the iterative solvers on a slurm cluster using DistributedArrays and I have to communicate data between workers every iteration, which is really slow because no InfiniBand support (10~200times slower than in-node data transfer). Do you find any way to avoid it? Or maybe we have to rework with MPI.jl?
Sadly, I’ve found no solution. I decided to port the program to Rust. Despite not having cluster support (yet), I can simply run each simulation by itself on each node manually. It’s a bit tedious, but it gets the job done in a fraction of the time Julia would take. I still use Julia from time to time for data analysis, the REPL is quite handy for this.
Sadly, I don’t know about the InfiniBand thing until all the program is ok. I find the bottleneck of communication these days while optimizing the codes and then the administrator of cluster asked me whether my codes use InfiniBand to send message or not. It’s too late 