I also have a very similar setup to you and I don’t use sbatch
nor the bash script. I would recommend using ClusterManagers.jl
. This is a far easier solution.
Suppose you have a function do_large_computation()
that you’d like to parallelize across nodes/cpus. You can setup your script like the following:
using ClusterManagers
addprocs(SlurmManager(500), N=17, topology=:master_worker, exeflags="--project=.")
This adds 500 worker instances over 17 nodes (I have 32 cores per node). You can ignore the topology
keyword for now, and exeflags
can be used to send command line arguments to each worker instance (in my case, I am activating the current env for each worker instance).
Now you can run your code as if you had done addprocs()
locally. So for example, you can do something like
@everywhere include("file.jl") # where file.jl includes your do_large_computation() function
# or
@everywhere using PkgA # if you'd like to load a package on the worker instances
Then to run the function in a parallel manner, I simply use pmap
, i.e.,
pmap(x -> do_large_computation(x), 1:nsims)
which launches and manages your function nsims
amount of times over the nodes. The results are all collected in an array and passed back to the head node (or the node from which pmap
was executed).
Let me know if you have other questions. It’s also a great exercise to see how ClusterManagers sets up the srun
command internally which brings a greater level of underunderstanding.