Running a function over multiple nodes on a julia cluster, from within a julia script

I have a function whose skeleton looks like this:

function inner(a,ij)
   return a*ij;
end

function outer(a,N)
   arr = zeros(ComplexF64, N);
   for ij = 1:N
        arr[ij] = inner(a,ij);
   end
   return sum(arr)
end

result = outer(1,1000);

I can run it on a slurm cluster, with n nodes and c cores per node. I want to parallelise the for loop inside the function “outer” such that each iteration of the loop runs in a different node, using all of its cores. For instance, iteration ij=1 runs on node 1 using its 16 cores, iteration ij = 2 runs on node 2 using its 16 cores, and so on.
What is the simplest way to achieve this?

In the way you wrote it, you can’t. Because different Julia processes - running on different nodes - won’t be able to simply access the same array. They don’t share memory. You would have to communicate the results from all processes and then, potentially, combine them.

If you want to use Distributed.jl and your example is representative of your actual code you should be able to simply use pmap. It will take care of the communication and merge for you.

1 Like

In principle, this example does represent the actual code, although the value N is around 20, and the function inner is the expensive unit which I was hoping to parallelise over multiple nodes. I will look into pmap.

The function outer is called multiple times in the main program. Does it mean I would incur long waiting times in spawning processes across multiple nodes every time outer is called? Is it possible to have the nodes allocated at the beginning and simply access them via outer every time it’s called?

No.

Yes. You add worker processes up front.

I am quite new to this. So, with the shell script for slurm which requests 2 nodes, 1 task per node, and 16 cores per node (my cluster has 16 cores, with 2 threads each, in each node),

#!/bin/bash

#SBATCH -J m_node
#SBATCH -t 0-04:00:00
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task=16

srun /home/userdir/julia-1.10.4/bin/julia  /home/userdir/Work/julia_mnode.jl

I am looking for the correct way to initialise the processes in the julia script.

using Distributed

addprocs(32)

println("Number of processes: ", nprocs())
println("Number of workers: ", nworkers())


@everywhere function inner(a,ij)
   sleep(5);
   println("Inside inner")

   return a*ij;
end

function outer(a,N)
   tt0 = time()
   g(x) = ij -> inner(x, ij);
   arrsum = sum(pmap(g(a), (1:N)));
   tt1 = time()
   println("outer time = $(tt1-tt0)")

   return arrsum
end


function innerserial(a,ij)
   sleep(5);
   println("Inside innerserial")

   return a*ij;
end

function outerserial(a,N)
   arrsum = 0;
   tt0 = time()
   for ij in 1:N
        arrsum = arrsum + innerserial(a,ij);
   end
   tt1 = time()
   println("outerserial time = $(tt1-tt0)")

   return arrsum
end

println("outer = ",outer(1,5))
println("outerserial = ",outerserial(1,5))

(a) What I require:
If the loop were distributed over 2 nodes (1 node evaluating one iteration using all of its 16 cores), then the time should have been around 15s. First, the two nodes evaluate ìnner once each, then the two nodes repeat it, and finally one node evaluates it once, totalling 5+5+5=15s. In this scenario, each evaluation of ìnner uses all 16 cores in each node.

(b) What I am seeing:
Instead, it is getting distributed over all cores, hence it is finishing in 5s. This means, each evaluation of ìnner is only getting 1 core. Also, everything is evaluated twice, almost as if both nodes are repeating the same thing.

Number of processes: 33
Number of workers: 32
Number of processes: 33
Number of workers: 32
      From worker 11:   Inside inner
      From worker 4:    Inside inner
      From worker 23:   Inside inner
      From worker 13:   Inside inner
      From worker 27:   Inside inner
      From worker 15:   Inside inner
      From worker 24:   Inside inner
outer time = 6.385792970657349
outer = 15
      From worker 21:   Inside inner
      From worker 14:   Inside inner
      From worker 16:   Inside inner
outer time = 6.389128923416138
outer = 15
Inside innerl
Inside innerl
Inside innerl
Inside innerl
Inside innerl
Inside innerl
Inside innerl
Inside innerl
Inside innerl
outerl time = 25.03065299987793
outerl = 15
Inside innerl
outerl time = 25.02852702140808
outerl = 15