Distributed.jl, DistributedArray.jl with InfiniBand cluster

I’m developing an algorithm using Distributed.jl and DistributedArrays.jl, and I’m supposed to run this algorithm on a cluster supercomputer. I’ve recently been told that it uses Infiniband to connect internal nodes with each other. It’s the first time I do parallel computing, and I’ve never heard about Infiniband before, so I wanted to ask if I can keep using DistributedArrays.jl or if I have to switch to MPI.jl (which I would avoid, if possible). Sorry if this question is a bit nooby and vague, but I’m a bit lost. Searching online gave different answers and in general it’s not something that is a lot talked about online. And as I’ve said, it’s my first time dealing with these kind of things.

1 Like

I’m in a similar situation: I’m starting to look at deploying our simulations in clusters with several nodes connected by infiniband, so I look forward to any response from people with this kind of experience.

(As far as I understand, Infiniband is much like good old ethernet, just much faster, so you can reduce the bottleneck of internode communication.)

Possibly relevant: Running Julia in a SLURM Cluster - #2 by CameronBieganek

1 Like

It looks like you’ll be a bit of a pioneer: Custom transport for Distributed.jl to utilize Infiniband and avoid MPI?

UCX.jl seems to be the leading choice here, but it hasn’t gotten a new release in over a year, although several PRs appear to be pending. cc @vchuravy

If this is your first time doing parallel computing, it’s probably easiest to bite the bullet and use MPI rather than acting as a test dummy for a new protocol.

Perhaps you could also consider PartitionedArrays.jl which uses MPI as a backend.

1 Like

That’s a pity. Looking online I did found people using Julia with IPoIB, which why I asked if it was possible. Than again, knowing nothing (yet) about the clsuter I’m about to work on, I’m not sure if IPoIB even applies.

You can use MPIClusterManager.jl to run Distributed.jl over MPI.

My long-term plan is to build something akin to Distributed.jl using UCX as the communication layer.
But there is no ETA on that :slight_smile:

3 Likes

Why do you need to use MPI? I’d say try with Distributed.jl and DistributedArrays.jl and use a cluster manager (say, ClusterManagers.jl). I am assuming your cluster has some workload management software already installed (e.g., Slurm, PBS).

My cluster is also connected with Infiniband, but to me that is just some backend magic to connect the nodes together. Instead, my workflow is simply to log in to my head node, start Julia, and use ClusterManagers to connect to the remaining nodes.

1 Like

That is what I thought too, but I wasn’t sure if using InfiniBand would have been a problem. I’ve never used a ClusterManager tho, I wouldn’t even know which one to choose.

Do you know what you have installed on your cluster? Your path to parallelization could be as simple as

using ClusterManagers
addprocs(SlurmManager(n)) 
pmap(1:100) do 
   #code here
end

which will start distribute the code in the do block over n nodes. The pmap function will manage scheduling, allocation, and return your results in an array for you.

Note I am talking about embarrassingly parallel problems. Multithreading is a whole different beast.

3 Likes

I still have to have access to the cluster. I’ll be sure to be back here once I know more. It’s good to know that I don’t necessarily need to use MPI.jl. I wanted to know if I had to rework my algorithm from scratch using MPI or not. Thank you.

This pointer is really useful @affans, thanks! But I cannot give you more than one “heart” :smiley:

Distributed.jl and DistributedArrays.jl are not the only possibility if you do not want to programm using MPI.jl. There are high-level packages that build on MPI.jl which render distributed parallelization easier, e.g. ImplicitGlobalGrid.jl:

High-level packages have of course often a certain focus and not each of them might be suited for your use case, which you have not mentioned though.

A Julia application that uses MPI.jl - whether directly or via a high-level package - can be submitted on a cluster pretty much the same way as any other MPI application, that might be written in C/C++/Fortran/… See here for an example on how you can run Julia applications, e.g., on the Piz Daint supercomputer at the Swiss National Supercomputing Centre:
https://user.cscs.ch/tools/interactive/julia/#how-to-run-on-piz-daint

1 Like

In the case of embarrassingly parallel problems isn’t it better to have SLURM (or whatever) launch several julia processes?

In cluster, small jobs usually have higher priority, so it is usually better to launch many small jobs (each one starting julia an then working on a piece of the 1:100) than having one big job (which starts julia an then works on the whole 1:100).

This may not be the most optimal workflow. In particular, Distributed doesn’t offer a tree-based reduce, which might make mapreduce operations (ie. distributed for loops) seriously slow compared to MPI. It’s really hard to compare with MPI if one needs to transfer large data across cores.

A lot of people use Slurm batch files to launch Julia processes, but I am a big fan of doing everything within Julia using ClusterManagers (which internally builds an srun command). My addprocs for example looks like

addprocs(SlurmManager(500), N=17, topology=:master_worker, exeflags="--project=.")

launching 500 worker processes across 17 nodes, after which I just use pmap to launch my long-running functions on each of the processes.

1 Like

I see, I don’t know much about tree-based reduce. For my work, pmap returns an array of values from each of the worker processes and I do my own reduce operations on it.

What does tree-based reduce mean?

Since we’re talking about use cases, I have to construct a 3D or 4D lattice of arbitary side length, and divide it into chunks to be processed by each worker. Each point is processed knowing its current value and the values of the nearest neighbors. In fact, in my implementation I don’t use pmap, I launch every process once and @sync them. I do this for a bunch of loops, making sure that the points each process is working on is totally independent and doesn’t require locking the array. The only exchange between processes is when neighbors happen to be on a part of the lattice owned by another process, which DistributedArrays.jl handles automatically for me simply indexing the global array object.