Slurm Manager not working

I am trying to use the ClusterManagers package for parallel computing, but I am getting an error message that I don’t understand. I am writing my code in a Jupyter notebook in VS Code, and this is the part of the initial code that I am trying to start from:

using ClusterManagers
using Distributed

OnCluster = true #set to false to run locally
addWorkers = true #set to false to run serially
println("OnCluster = $(OnCluster)")

# Current number of workers
currentWorkers = nworkers()
println("Initial number of workers = $(currentWorkers)")

# Increase the number of workers available
maxNumberWorkers = 10
if addWorkers == true
	if OnCluster == true
	  addprocs(SlurmManager(maxNumberWorkers))
	else
	  addprocs(maxNumberWorkers)
	end
end

However, I get an error message:

OnCluster = true
Initial number of workers = 1
Error launching Slurm job:
Output exceeds the size limit. Open the full output data in a text editor
TaskFailedException

    nested task error: IOError: could not spawn `srun -J julia-59209 -n 10 -o '/Users/my_name/Documents/my_project/Julia Code/Quantitative (simple)/./julia-59209-16770052037-%4t.out' -D '/Users/my_name/Documents/my_project/Julia Code/Quantitative (simple)' /System/Volumes/Data/Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia --worker=arf01ZXwDrm50sEj`: no such file or directory (ENOENT)
    Stacktrace:
     [1] _spawn_primitive(file::String, cmd::Cmd, stdio::Vector{Union{RawFD, IO}})
       @ Base ./process.jl:128
     [2] #725
       @ ./process.jl:139 [inlined]
     [3] setup_stdios(f::Base.var"#725#726"{Cmd}, stdios::Vector{Union{RawFD, IO}})
       @ Base ./process.jl:223
     [4] _spawn
       @ ./process.jl:138 [inlined]
     [5] #open#734
       @ ./process.jl:393 [inlined]
     [6] open (repeats 2 times)
       @ ./process.jl:383 [inlined]
     [7] launch(manager::SlurmManager, params::Dict{Symbol, Any}, instances_arr::Vector{WorkerConfig}, c::Condition)
       @ ClusterManagers ~/.julia/packages/ClusterManagers/S7Syg/src/slurm.jl:60
     [8] (::Distributed.var"#43#46"{SlurmManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:484

It seems there might be a problem with the path, but I have no clue what to do since I have very poor knowledge of computer science. Can anyone suggest a solution? Thanks in advance.

That’s a pretty confusing error message. When it says

no such file or directory (ENOENT)

it’s probably not referring to any of the arguments in that command; it’s referring to srun. That is, it can’t find srun in the PATH. This may be surprising because srun is probably in your PATH when you run your interactive shell. I suppose your interactive-shell PATH is being set in ~/.bashrc or something, but that isn’t sourced for whatever runs the command. Look into the differences between ~/.bashrc, ~/.bash_profile, ~/.profile, etc., on your cluster.

It looks like there’s no direct way to alter that command. So you’ll have to figure out how to adjust your PATH correctly. Just to be sure, run

type -a srun

in your terminal. This will give you the full path to srun. Take the directory part of that full path, and add it to your PATH somewhere. It may be as easy adding something like

ENV["PATH"] = "/path/to/parent/directory:"*ENV["PATH"]

(where you replace /path/to/parent/directory with whatever the type command tells you, but keep that colon : just before the closing quote) to your julia script.

2 Likes

Thank you very much. I will try that. It’s kind of a shame that I only know how to open VS Code, write code, and run Julia, but I’m unable to really know what’s going on behind the scenes.

Well, that’s a good start! Most of us just pick things up as we go along.

Good luck.

2 Likes