The parallel part of my Julia script utilizes the Distributed.jl
package and the remotecall_fetch
function. (I am not using MPI.) In short, it calculates many linear programming problems in parallel independently. (Here is the sample code.) On my HPC, each node has 40 CPUs and 2 hyper-threads per CPU. I was able to parallel my code on one node with 40*2=80 threads. Now, I need to solve much more problems, so I want to parallel on multiple nodes and more threads. I wonder how I should do this?
I saw this post, but couldn’t replicate what people suggested there. Let’s say I need 80 CPUs from 2 nodes.
First option: as suggested by @ChrisRackauckas in his blog. I use machine-file
and write the slurm script as follows:
#!/bin/bash
#SBATCH --job-name=xz49
#SBATCH --partition=interactive
#SBATCH --nodes=2
#SBATCH --export=ALL
#SBATCH --ntasks-per-node=40
#SBATCH --mem-per-cpu=4G
#SBATCH --time=00:30:00
julia --machine-file $SLURM_JOB_NODELIST gurobi.jl
Then, I got the following error
ERROR: SystemError: opening file "/home/xz49/bc4u11n[1-2]": No such file or directory
In the example in his blog, he has
export SLURM_NODEFILE=`generate_pbs_nodefile`
./julia --machine-file $SLURM_NODEFILE /home/crackauc/test.jl
But I couldn’t find a way to generate the slurm node file. Also, how should I utilize the hyper-threads in this case?
Second option: use the ClusterManagers.jl
package. My understanding is that I do not need to write a slurm script anymore. So, on top of my Julia script. I add
using ClusterManagers
addprocs_slurm(80, partition="interactive", time="00:30:00", mem_per_cpu="4G")
Then, I type julia test.jl
in the shell to launch my Julia script. I got the following error
Error launching Slurm job:
ERROR: LoadError: TaskFailedException:
MethodError: no method matching replace(::String, ::String, ::String)
It seems that this is due to an unresolved bug with this package. Is there a way to circumvent this bug? Also, how should I utilize the hyper-threads in this case?
Third option: I guess this is wrong, but I have tried the following slurm script.
#SBATCH --nodes=2
#SBATCH --ntasks=80
But it seems that it will run the Julia script 80 times in parallel rather than parallel it on 80 cores.
Any suggestions will be greatly appreciated.