I know I’m a little late to this thread, but had the same problem. I hope this helps others out. My cluster uses PBS, but no ssh communication is allowed between nodes. My Julia code uses DistributedArrays, @sync and @async blocks, and I didn’t want to modify it too much. So I did the following as suggested by @barche.
using MyModules, MPI
# serial part of code
# here to set options for parallel run
# parallel start
mgr = MPI.start_main_loop(MPI.MPI_TRANSPORT_ALL)
addprocs(nworkers)
@info "workers are $(workers())"
@everywhere any(pwd() .== LOAD_PATH) || push!(LOAD_PATH, pwd())
@everywhere using Distributed, MyModules
# parallel code here using MyModules.foo(options, data)
rmprocs(workers())
MPI.stop_main_loop(mgr)
Crucially, I had to run the code as
mpirun -np 1 julia MyCode.jl
Here’s the whole PBS script requesting 32 workers:
#PBS -P blah
#PBS -q myque
#PBS -l ncpus=32
#PBS -l mem=256GB
#PBS -l walltime=00:15:00
#PBS -l wd
#PBS -N testJulia
#PBS -o grid.out
#PBS -e grid.err
#PBS -j oe
ulimit -s unlimited
ulimit -c unlimited
module load gcc/5.2.0 openmpi/3.0.1 julia/1.1.1
mpirun -np 1 julia ./MyCode.jl > outfile.run