Getting started with distributed Julia computations on a cluster

Thus far, my parallel Julia experiences have involved running jobs on a single node on a cluster (shared memory environment) using the Distributed.jl module for computations like:

using SharedArrays

a = SharedArray{Float64}(10)
@distributed for i = 1:10
    a[i] = i
end

Which is to say, computations where there is, at most, a reduction operation amongst the workers.

If possible, I would like to begin taking advantage of multiple nodes on the cluster I have access to, but I’m having some trouble getting a sense of how to get started using Julia in a distributed memory environment. Does anyone have any suggestions on how to get started? If it’s at all helpful, the cluster I’m using runs the Univa Grid Engine.

1 Like

The distributed equivalent of a SharedArray is DistributedArray. This should do something like what you were doing:

julia> DArray((10,)) do I 
       a = Array{Float64}(undef, size.(I)...)
       for (ind, i) in enumerate(I[1])
           a[ind] = i
       end
       a
       end
10-element DArray{Float64,1,Array{Float64,1}}:
  1.0
  2.0
  3.0
  4.0
  5.0
  6.0
  7.0
  8.0
  9.0
 10.0

Here I holds the indices of the local part on each worker.

Have a look into ClusterManagers.jl to see if it helps in launching jobs on your cluster.

For jobs that are embarrassingly parallel, look into pmap. This keeps track of free workers and submits jobs to them, ensuring that the nodes are evenly loaded.

I’ve found that parallel mapreduce can be sped up by using binary-tree based reductions locally on each node instead of the @distributed (op) for loops in Distributed. I had written a package to do this, although this isn’t widely tested. For comparison, on a Slurm cluster using 2 nodes with 28 cores on each I obtain

julia> @time @distributed (+) for i=1:nworkers()
           ones(10_000, 1_000)
       end;
 22.355047 seconds (7.05 M allocations: 8.451 GiB, 6.73% gc time)

julia> @time pmapsum(x -> ones(10_000, 1_000), 1:nworkers());
  2.672838 seconds (52.83 k allocations: 78.295 MiB, 0.53% gc time)
3 Likes