I am using a computer cluster to parallelize the execution of my Julia code.
The partition I am using has a total of 14 nodes, and my goal is to connect to this partition and execute my main.jl file, which specifies the following lines at the beginning:
using Distributed
using ClusterManagers
addprocs(SlurmManager(14))
My question is: how should I configure the Slurm script to ensure the code parallelizes correctly? Currently, my Slurm script contains the following lines:
#SBATCH --tasks=1
#SBATCH --nodes=14
I understand that the number of tasks should be 1, as I am only submitting the job to execute my main.jl file, but setting --nodes=14 generates an error indicating that more processors were requested than available. I believe there might be a configuration with #SBATCH --cpus-per-task=M, but I am unsure if this applies in my case.
This doesn’t necessarily answer your question, but consider checking out SlurmClusterManager.jl , which leaves the resource allocation entirely to slurm and the addprocs inside Julia without additional arguments will just add procs according to your slurm allocation. I found it much more straightforward that way…
You do not need a batch file if you are using ClusterManagers. When you run addprocs(SlurmManager(14)) it runs an srun command internally to allocate available resources. If you need more fine-control over the allocation (like --nodes), you can pass in more arguments to SlurmManager, see documentation. Once you do addprocs(SlurmManager(14)), you can verify Slurm has allocated resources by running sinfo or squeue in the terminal.
From Julia’s point of view, you now have N = 14 distributed workers. The easiest way to parallelize your code is to use pmap. You code can look something like the following:
function long_running_simulation(simid)
sleep(5)
println("running $simid on host $gethostname()")
return value
end
function run_simulation(simid)
results = pmap(1:n_sims) do x
long_running_simulation(x)
end
end
This will run independent copies of long_running_simulations over processors as they are available. So for example, if n_sims = 100 then it will initially run 14 long_running_simulations, and as each simulation finishes and a processor becomes available, it will run long_running_simulations() again, and so on. The variable results is an array of the the return values of long_running_simulations.
Note: There is some additional work to be done like running @everywhere to make sure long_running_simulations is actually available on the worker processes.
Thanks @affans for your answer. I had no idea that when using ClusterManagers.jl, it’s not necessary to manually create a “.slurm” file. On the other hand, my parallelization specifically involves using the pmap() function and @everywhere to define my parallelizable function on all workers.
However, I still have one question, and I would appreciate it if you could help me.
Since I’m not creating a slurm file, arguments like the partition name, the maximum execution time, etc., I understand that they are passed as kwargs to addprocs(SlurmManager(14)). Do you know where I can find a list of possible kwargs that this function accepts?