HPC cluster SSH security issue? (password vs passwordless ssh keypair)

Hello,

Previously, I used to be able to run multi-core jobs on my university’s HPC SLURM cluster like this:

#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=mymail@mailymail.country
#SBATCH --export=ALL
#SBATCH -J a_julia_job
#SBATCH --partition=xeon16
#SBATCH -n 64
#SBATCH -N 4
#SBATCH --mem=50G
#SBATCH --time=5-00:00:00
#SBATCH --output=log_file_good_info.out
srun hostname | sort > nodefile.$SLURM_JOBID
julia --machine-file nodefile.$SLURM_JOBID /the/path/to/the/script/thascript.jl inputs_file.jl

But now, I get an authentication error message after a couple of seconds, even when I try to use only my main computational node (instead of 4 in total, for example):

Permission denied, please try again.^M
Permission denied, please try again.^M
Received disconnect from 12.3.456.7 port 22:2: Too many authentication failures^M
Authentication failed.^M
ERROR: TaskFailedException:
Unable to read host:port string from worker. Launch command exited with error?
Stacktrace:
[1] worker_from_id(::Distributed.ProcessGroup, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:1059
[2] worker_from_id at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:1056 [inlined]
[3] #remote_do#156 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:482 [inlined]
[4] remote_do at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:482 [inlined]
[5] kill at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:534 [inlined]
[6] create_worker(::Distributed.SSHManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:581
[7] setup_launched_worker(::Distributed.SSHManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:523
[8] (::Distributed.var"#43#46"{Distributed.SSHManager,Array{Int64,1},WorkerConfig})() at ./task.jl:333
Stacktrace:
[1] sync_end(::Array{Any,1}) at ./task.jl:300
[2] macro expansion at ./task.jl:319 [inlined]
[3] #addprocs_locked#40(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::typeof(Distributed.addprocs_locked), ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:477
[4] #addprocs_locked at ./none:0 [inlined]
[5] #addprocs#39(::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:tunnel, :sshflags, :max_parallel),Tuple{Bool,Cmd,Int64}}}, ::typeof(addprocs), ::Distributed.SSHManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:441
[6] #addprocs at ./none:0 [inlined]
[7] #addprocs#243(::Bool, ::Cmd, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(addprocs), ::Array{Any,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:118
[8] addprocs(::Array{Any,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/managers.jl:117
[9] process_opts(::Base.JLOptions) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/cluster.jl:1305
[10] #invokelatest#1 at ./essentials.jl:709 [inlined]
[11] invokelatest at ./essentials.jl:708 [inlined]
[12] exec_options(::Base.JLOptions) at ./client.jl:254
[13] _start() at ./client.jl:460

I believe I ran a Pkg.update() shortly before this. Nevertheless, do you think this is an error caused by the Julia update, a cluster security update of some kind or a fault of my own? I have double-checked so that I haven’t made any errors, but of course I can have overlooked something still.

I am aware that the Julialang FAQs recommend running Julia scripts with options with #!/bin/bash in a different way. Maybe that is what I should do? However, this way of doing it has worked perfectly fine on my HPC cluster up until this week.

Grateful for any thoughts/comments! Sorry if this post/question should go somewhere else/stack overflow. If that is the case, please re-direct me.

Looks like a security update to me.
Slurm clusters use something called munge for authentication. It could be that the cluster admins have disabled ssh access.

TO check I would write a quick jobscript which produces that nodefile
Then write a bach loop which runs trhoug the list and does
ssh $host date

Or just run a two job interactive Slurm job and try to ssh between the compute nodes

I should say I am basing my response on those first 4 lines, but it is something you should be able to check quickly

Thank you. Unfortunately on my university cluster, interactive Slurm jobs are disabled. But I am starting to think this is a cluster security issue for sure.

Update: I should now add that I am investigating whether it might have something to do with me recently creating an ssh keypair with password, rather than using the standard non-password ssh keypair. Does anyone know if Julia looks at an ssh key, and examines whether there is a required password or not? And then tries to connect accordingly?

You should be able to use an ssh agent in the job though - actually I don;t have experience with that as I would always use a passwordless pair on a cluster.

I guess the other solution would be a passwordless pair and some sort of .ssh/config which says ‘only use this passwordless air when using cluster nodes’
I think this is quite easy

I am going to be shot for this advice - by security mavens
The .ssh/config could read

Host node*
    IdentityFile ~/.ssh/my-key-with-no-password

I think configuring ssh-agent in the job may be better - and keep the passphrase for the key in a file in your home directory, don't expose it in the job script

Yes that could be a way to go. I agree, not exposing it in the job script would be preferable, and make the world a safer place.

Thanks. I however also have the possibility of going back to only using a passwordless keypair. I will think a little bit about if I really need a password keypair at the moment. It was something I created just because anyway.

Another update: It works again. The .ssh/authorized_keys file was re-generated and that solved the issue.

I had a similar problem in the past. If I remember correctly, if you call julia like this you bypass SLURM (and the accounting of computing resources) and depending on the cluster configuration this might not be allowed or recommended. I ended up using ClusterManagers which integrates nicely with SLURM and uses srun to launch the processes on the worker node (https://github.com/JuliaParallel/ClusterManagers.jl/blob/master/src/slurm.jl#L54) .

In my SLURM script, I have code block like this (where $script is the full path of the julia script):

julia <<EOF
using Distributed
using ClusterManagers
addprocs(SlurmManager($SLURM_NTASKS))

hosts = []
pids = []
for i in workers()
    host, pid = fetch(@spawnat i (gethostname(), getpid()))
    push!(hosts, host)
    push!(pids, pid)
end

@show hosts

include("$script")

for i in workers()
    rmprocs(i)
end
EOF

You can also skip the gethostname(), getpid() part, but it can be useful for troubleshooting.
No SSH logins are necessary for ClusterManagers.SlurmManager.

3 Likes

That was a really good example of a better way to do it. I tried it just now and it worked perfectly. In addition to being the correct way of doing it and giving more log for de-bugging, I also think it creates much less memory overhead. Need to investigate this further though.

Thank you very much!

1 Like