Hello,
Previously, I used to be able to run multi-core jobs on my university’s HPC SLURM cluster like this:
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=mymail@mailymail.country
#SBATCH --export=ALL
#SBATCH -J a_julia_job
#SBATCH --partition=xeon16
#SBATCH -n 64
#SBATCH -N 4
#SBATCH --mem=50G
#SBATCH --time=5-00:00:00
#SBATCH --output=log_file_good_info.out
srun hostname | sort > nodefile.$SLURM_JOBID
julia --machine-file nodefile.$SLURM_JOBID /the/path/to/the/script/thascript.jl inputs_file.jl
But now, I get an authentication error message after a couple of seconds, even when I try to use only my main computational node (instead of 4 in total, for example):
Permission denied, please try again.^M
Permission denied, please try again.^M
Received disconnect from 12.3.456.7 port 22:2: Too many authentication failures^M
Authentication failed.^M
ERROR: TaskFailedException:
Unable to read host:port string from worker. Launch command exited with error?
Stacktrace:
[1] worker_from_id(::Distributed.ProcessGroup, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:1059
[2] worker_from_id at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:1056 [inlined]
[3] #remote_do#156 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:482 [inlined]
[4] remote_do at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:482 [inlined]
[5] kill at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:534 [inlined]
[6] create_worker(::Distributed.SSHManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:581
[7] setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:523
[8] (::Distributed.var"#43#46"{Distributed.SSHManager,Array{Int64,1},WorkerConfig})() at ./task.jl:333
Stacktrace:
[1] sync_end(::Array{Any,1}) at ./task.jl:300
[2] macro expansion at ./task.jl:319 [inlined]
[3] #addprocs_locked#40(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::typeof(Distributed.addprocs_locked), ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:477
[4] #addprocs_locked at ./none:0 [inlined]
[5] #addprocs#39(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::typeof(addprocs), ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:441
[6] #addprocs at ./none:0 [inlined]
[7] #addprocs#243(::Bool, ::Cmd, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(addprocs), ::Array{Any,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:118
[8] addprocs(::Array{Any,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:117
[9] process_opts(::Base.JLOptions) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:1305
[10] #invokelatest#1 at ./essentials.jl:709 [inlined]
[11] invokelatest at ./essentials.jl:708 [inlined]
[12] exec_options(::Base.JLOptions) at ./client.jl:254
[13] _start() at ./client.jl:460
I believe I ran a Pkg.update() shortly before this. Nevertheless, do you think this is an error caused by the Julia update, a cluster security update of some kind or a fault of my own? I have double-checked so that I haven’t made any errors, but of course I can have overlooked something still.
I am aware that the Julialang FAQs recommend running Julia scripts with options with #!/bin/bash in a different way. Maybe that is what I should do? However, this way of doing it has worked perfectly fine on my HPC cluster up until this week.
Grateful for any thoughts/comments! Sorry if this post/question should go somewhere else/stack overflow. If that is the case, please re-direct me.