Running A Julia Script Through SLURM

I have a julia script that run a physics simulation. The physics simulation takes around 8 hours, and I want to run it with many random initial conditions, so I planned to run the script multiple times using a SLURM job array, using ArgParse.jl to process command line arguments which determines the parameters of the simulation, including the seed to use for the randomization (which is what the job array ID controls). My script does not use multithreading or distributed computing in any way, I just want to run it many time with different initial conditions

Unfortunately, I am somehow getting a segmentation fault from ArgParse.jl

Here is a MWE.

argparse.jl:

using ArgParse

s = ArgParseSettings()
#add_arg_table(s, "arg1", Dict(:nargs=>1, :required=>true))
@add_arg_table s begin
    "arg1"
        help = "First argument"
        required = true
end
p = parse_args(s)
println("Argument 1 is: ", p["arg1"])

argparse.sb:

#!/bin/bash --login
#SBATCH --job-name=argparse  # Job name
#SBATCH --mail-type=NONE             # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --nodes=1                    # Maximum number of nodes to be allocated
#SBATCH --ntasks-per-node=1          # Maximum number of tasks on each node
#SBATCH --cpus-per-task=1            # Number of processors for each task (want several because the BLAS is multithreaded, even though my Julia code is not)
#SBATCH --mem=2G                     # Memory (i.e. RAM) per NODE
#SBATCH --export=ALL                
#SBATCH --constraint=intel18         
#SBATCH --time=0-00:05:00            # Wall time limit (days-hrs:min:sec)
#SBATCH --output=argparse_%A.log     # Path to the standard output and error files relative to the working directory


echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Per Node       = $SLURM_NTASKS_PER_NODE"
echo "Number of CPUs Per Task       = $SLURM_CPUS_PER_TASK"
echo ""

which julia
julia ./argparse.jl 1

The log generated by running sbatch argparse.sb:

Date              = Fri Nov 29 05:59:40 PM EST 2024
Hostname          = skl-027
Number of Nodes Allocated      = 1
Number of Tasks Per Node       = 1
Number of CPUs Per Task       = 1

/mnt/home/leespen1/.juliaup/bin/julia
/var/lib/slurmd/job47051755/slurm_script: line 22: 1426934 Segmentation fault      (core dumped) julia ./argparse.jl 1

I have no idea where to go from here. Any advice on how to fix this (or an alternative workflow which would accomplish the same thing) would be appreciated. All the material I have found for using Julia in HPC have been about how to use multi-threading or distributed computing, which is not what I am trying to do.

PS, the segfault does not happen when I run argparse.sb as a bash script using salloc:

eespen1@dev-intel18:~/Research/QuantumGateDesign.jl/cnot3$ salloc --nodes=1 --ntasks=1 --mem=2G --cpus-per-task=1 --constraint=intel18 --time=00:05:00
salloc: Granted job allocation 47051784
salloc: Waiting for resource configuration
salloc: Nodes skl-031 are ready for job
leespen1@skl-031:~/Research/QuantumGateDesign.jl/cnot3$ bash argparse.sb 
Date              = Fri Nov 29 06:11:15 PM EST 2024
Hostname          = skl-031
Number of Nodes Allocated      = 1
Number of Tasks Per Node       = 
Number of CPUs Per Task       = 1

/mnt/home/leespen1/.juliaup/bin/julia
Argument 1 is: 1

Sounds strange…

I’m using a very similar setup, so the general idea should definitely work (create a Julia script with ArgParse and call it from the Slurm script).

  • In my experience, when salloc and sbatch do different things, it might be related to the shell environment, but not sure here since you start bash as a login shell and export the user environment explicitly. But perhaps you can check (e.g. by using the bash command env to print the whole environment and see if it differs somehow).
  • EDIT: Related to the first point: Is there anything in your .bashrc that could make the two cases behave differently?
  • Another useful step in debugging this problem would be to compare the output of versioninfo() and Pkg.status() in the Julia scripts. It looks like the same executable is called, but maybe the problem is in a specific version of Julia and/or the packages.
  • Are you using the global Julia environment? Have you tried creating a new environment in the script and adding your dependencies (ArgParse here) before you call the script. If the problem is related to (pre)compilation, this might help sorting it out.
  • Related to the compilation question: Are the machine CPUs identical? You specify intel18 in both, but I don’t know if there could be different details about the CPUs still?
  • Does the issue appear always on the same machine or on all machines (you could request a certain node explicitly to test that)?

Might be something completely unrelated :sweat_smile: but maybe checking these things can help narrow down the problem.

1 Like

How much access do you have on your cluster? Could you ssh into the allocated node skl-027 manually and run the script directly to see if it produces a segfault? There could be a hardware error on that node.

2 Likes

Alternative workflow that might work. I don’t use argparse for this, I just do this in Julia:

id = Base.parse(Int, ENV["SLURM_ARRAY_TASK_ID"])

which is the job id created by slurm. So if you have array=1-30 in your slurm submit script, each Julia job will have access to a unique id number to set up parameters etc.

2 Likes

I was on a deadline so I ended up just removing the ArgParse dependency and writing a bash script to replace the relevant variables, but when I have time I will try your suggestions. Thank you very much!