I am trying to run my julia code on multiple nodes of a cluster, which uses Moab and Torque for the scheduler and resource manager. In an interactive session I run the following:
qsub -A open -l walltime=1:00:00 -l nodes=4:ppn=3 -I
module purge
module use /gpfs/group/dml129/default/sw/modules
module load openmpi
module load julia
export UCX_TLS=all
mpirun -np 12 --hostfile $PBS_NODEFILE -x PATH -x LD_LIBRARY_PATH -display-allocation julia --project=. "estimation/test/test.jl" TCP
However, I get the following error:
--------------------------------------------------------------------------
ERROR: ERROR: ERROR: LoadError: LoadError: LoadError: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:IOError: connect: connection refused (ECONNREFUSED)
It seems each node is returning an ECONNREFUSED
error.
What could be done to fix this issue?
Just in case, my test julia code is as follows, and uses the MPIClusterManagers.jl
:
using MPIClusterManagers, Distributed
import MPI
manager = MPIClusterManagers.start_main_loop(TCP_TRANSPORT_ALL)
@everywhere function plusone(x)
sleep(10)
return x + 1
end
@time pmap(plusone, 1:6)
MPIClusterManagers.stop_main_loop(manager)
exit()