SLURM: Julia scripts do not exit or produce any output

Dear Julia community,

After successfully experimenting with Julia on my local machine, I am trying to get it to work on a cluster running SLURM.
After submitting a batch script (see below for an example) the job gets started and just continues sitting on the nodelist without actually doing anything. I do not get any output or error message until the job time runs out.
When trying to identify the problem, I have reduced my script to the most minimal (not-) working example
batch script:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=15
#SBATCH --qos=hiprio
#SBATCH --time=00:02:00          # total run time limit (HH:MM:SS)
#SBATCH --output=/home/<user>/Julia/jultest.out  #where <user> is my username
#SBATCH --error=/home/<user>/Julia/jultest.err

module purge
module load Julia
cd /home/<user>/Julia
srun julia test.jl

# Filename: test.jl
println("test")

I am now completely at a loss how I can continue from here.
A while ago, I have tested a more involved script on two nodes with success which I found online, but now that also does not work any more. I am starting to think that it might not be a Julia problem but rather a cluster-specific issue. If I run the script from the shell it works fine. Does anyone have a good minimal working example or see anything wrong with the way I am submitting my job?
(My actual goal was to run a multithreaded program with Threads.@threads on a single node)

Thank you for reading until here, I hope you can help me!

2 Likes

Hi !
I use julia on a slurm cluster with Threads.@threads on a single node, this is my SBATCH

#!/bin/bash
#SBATCH -J AnovaCluster
#SBATCH -N 1 
#SBATCH -n 1 
#SBATCH --threads-per-core=1
#SBATCH --cpus-per-task=36
#SBATCH --time=03:10:00
#SBATCH --mail-user=xxx@yyy
#SBATCH --mail-type=ALL

module load julia/curr

workdir=/tmpdir/<user>/anova_cluster
cd ${workdir}

export JULIA_NUM_THREADS=36

julia main.jl

And to start calculation : sbatch sbatch_file.sh

1 Like

This is likely because of buffering. You can try to explicitly flush the output or write to stderr (which I think is unbuffered).

1 Like

Thank you, I found my issue!
it seems like the

#SBATCH --cpus-per-task= ...

line was essential, (I feel slightly embarrassed although I would have expected an error message in that case). Adding this line to my batch caused the script to finish and produce output.
As a side-note, threads also seem to work as intended.

Thank you a lot for your help!

Thank you for the idea, the solution turned out to be more trivial than that. (Actually for the very short script the output is probably flushed at the end anyways, so it doesnt matter here, might help in some other cases though :smile:).

I’m having a problem similar to the OP, I think. I’m in a slurm cluster with several nodes. Specifying the output argument to have all output collected in a single file, I am unable to create processes with addprocs_slurm. The following code gets stuck trying to connect.

julia> using ClusterManagers

julia> addprocs_slurm(32, p = "esbirro", x  = "es[1]", ntasks_per_node = 16, exeflags = "--project", output = "job.out")
connecting to worker 1 out of 32  ## It gets stuck here

If I remove the output parameter it works, generating 32 output files, but I would like to avoid this.
@kristoffer.carlsson, how can I “flush the output”?

I am not sure if each process can write to the same file at the same time, but having the separate files is a bit tedious. In my scripts I usually add a rm julia-*.out to remove them. Alternatively, you could add a command to concatenate them all into a single file afterwards. The errors are usually put in the main output file, at least in my experience.

1 Like

Ah, very good point. However, I suspect there is something else going on, because if I do output = "/dev/null" I get the same behaviour

It might still have each process try to lock the file, but I think your solution should work. There is also another reason that might not work - I think the output files are important in this case, since I have always noticed the IP of the node, along with the hostname at the top of each output file. I have a feeling they use this to communicate to the main node to setup the processes and connect the processes. Maybe someone else knows exactly what’s going on, but that would be my guess.

1 Like

Poking around the source code to ClusterManagers.jl: it seems that the output file name is not constructed from the output kwarg, but rather from kwarg job_ouput_loc and a hardcoded job_output_name, with a unique identifier per task. Check it out: ClusterManagers.jl/slurm.jl at 14e7302f068794099344d5d93f71979aaf4fbeb3 · JuliaParallel/ClusterManagers.jl · GitHub
This at least allows us to specify the path to keeps things a bit tidier, but indeed, we cannot have a single output file for all jobs currently.

I’ve resorted to stashing the files in a folder of their own, and then to deleting them with an epilog script, similar to the way you do.

1 Like