Help setting up Julia on a cluster

I have not tried it on 1.0, so can’t comment on that, but the functions you linked to shouldn’t be necessary to use unless you are writing your own cluster manager. Have you tried:

  1. Putting addprocs_pbs(16) at the top of your Julia script, the same way you would use addprocs() in a script running on a local machine?

or, if that doesn’t work,

  1. Calling Julia using the --machinefile option in your PBS script, similar to my example above?

--machinefile option does not work in 1.0. I’ll try addprocs_pbs().

I am unable to add clustermanager due to some issues with my cluster account, therefore I was looking for a native solution.

Thanks!

I think --machinefile has been replaced by --machine-file in julia v0.7+

Hello,

Sorry to revive this issue but I too am trying to set up running some Julia code on a remote cluster using PBS. I tried to run the test script test_julia.jl @ElOceanografo posted, with the update @juliohm posted, i.e. :

using Distributed
using ClusterManagers 
addprocs_pbs(15)

println("Hello from Julia")
np = nprocs()
println("Number of processes: $np")

for i in workers()
    host, pid = fetch(@spawnat i (gethostname(), getpid()))
    println("Hello from process $(pid) on host $(host)!")
end

tasks = randn(np * 30)

@everywhere begin
    function foo(x)
        return x * 4
    end
end

results = pmap(foo, tasks)

println(results)

for i in workers()
    rmprocs(i)
end

Where my submission script looks like,

#!/bin/sh  
#PBS -N test_parallel
#PBS -l walltime=24:00:00
#PBS -l nodes=1:ppn=16
#PBS -j oe 

cd $PBS_O_WORKDIR

julia test.jl 

Unfortunately this then returns the error(s),

┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:928
Error launching workers
MethodError(iterate, (Base.ProcessChain(Base.Process[Process(`echo 'cd /home/puck/WORK/testParallel && /usr/bin/julia --worker=c7kM2ZZEVlrMc45M'`, ProcessRunning), Process(`qsub -N julia-316294 -j oe -k o -t 1-15`, ProcessRunning)], Base.DevNull(), Base.PipeEndpoint(RawFD(0x00000011) open, 0 bytes waiting), Base.DevNull()),), 0x00000000000063e9)
Hello from Julia
Number of processes: 1
Hello from process 316294 on host planck!
[-0.838315, 1.15875, -1.13005, -2.51021, 0.299758, -1.20761, -2.66802, 0.591652, 1.14451, -6.3455, -6.15408, 1.97041, -0.74972, -3.17471, 7.71404, 1.37577, -3.61361, 2.89938, 2.10592, 4.70652, -1.72959, -2.48799, -2.66151, -0.136183, -1.27427, -4.37823, -1.17756, 6.24257, -5.0602, -4.91916]

I’m new to HPC services as well as Julia, so I’m afraid I’m quite baffled by what’s going on. It looks like it’s only launching one worker, though I’m using a machine with 32 CPUs (1 socket, 16 cores per socket, 2 threads per core).

Could anyone point out the obvious to me here?

Have you tried starting Julia with the --machine-file option? I haven’t done any HPC in a while, but on the cluster I was using I couldn’t get the cluster to launch processes via addprocs_pbs.

Unfortunately I tried it and I get the following error message,

PBS: node file is /var/spool/torque/aux//230.planck.localhost
Host key verification failed.
ERROR: Unable to read host:port string from worker. Launch command exited with error?
read_worker_host_port(::Base.PipeEndpoint) at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:275
connect(::Distributed.SSHManager, ::Int64, ::WorkerConfig) at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Distributed/src/managers.jl:397
create_worker(::Distributed.SSHManager, ::WorkerConfig) at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:501
setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:447
(::getfield(Distributed, Symbol("##47#50")){Distributed.SSHManager,WorkerConfig})() at ./task.jl:259
Stacktrace:
 [1] macro expansion at ./task.jl:245 [inlined]
 [2] #addprocs_locked#44(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::Function, ::Distributed.SSHManager) at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:401
 [3] #addprocs_locked at ./none:0 [inlined]
 [4] #addprocs#43(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::Function, ::Distributed.SSHManager) at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:365
 [5] #addprocs at ./none:0 [inlined]
 [6] #addprocs#249(::Bool, ::Cmd, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Array{Any,1}) at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Distributed/src/managers.jl:118
 [7] addprocs(::Array{Any,1}) at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Distributed/src/managers.jl:117
 [8] process_opts(::Base.JLOptions) at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.1/Distributed/src/cluster.jl:1225
finished

I’m not entirely sure what’s going on here - I guess the key is Host key verification failed. - any ideas why this might be?

Yeah, that looks like an issue with SSH on your cluster rather than Julia itself. @juliohm was running into something similar a couple years ago when this thread started, I don’t know if he ever got it resolved? At any rate, I’d recommend asking your cluster admin about it at this point.

2 Likes

Or, you could try this to get around an SSH restriction:

1 Like

Hi guys,

I am pretty new to multiprocessing in PBS cluster.

I’m just wondering can I actually use npus=64 option instead in my PBS job script and then use Distributed.addprocs? In my understanding this will do multiprocessing in the same node instead of submitting whole new job/qsub, so that should be alright?

Thanks.