Poor CPU Utilisation on HPC Cluster

Running into an odd error and not sure where to begin, wondering if you folks had any advice.

I’m using Distributed.jl to parallelise iterations of a CPU-intensive function, with my code looking like this:

# code in myfile.jl

@everywhere begin
  using MyPackage

@everywhere my_func(params)
  # do lots of computations here
  return my_vals

@everywhere myparams = # 81-long vector of parameter objects

result = pmap(params -> my_func(params), myparams) # Returns a vector of my_vals values

# write out result as a text file

I’m running this on an HPC cluster using Slurm as a scheduler, with the following batch script:

# set the number of nodes.
#SBATCH --nodes=4
# set the number of CPUs required.
#SBATCH --ntasks-per-node=41
# set the amount of memory needed for each CPU.
#SBATCH --mem-per-cpu=8000
# set max wallclock time (hh:mm:ss).
#SBATCH --time=96:00:00
# set the time partition for the job. 
#SBATCH --partition=long
# set name of job (AND DATE!)
#SBATCH --job-name=MyLongJob
# mail alert at start, end, and abortion of execution
#SBATCH --mail-type=ALL
# send mail to this address
#SBATCH --mail-user=my_email@my_inst.ac.uk
# run the application

module load Julia
julia -p 81 myfile.jl

When I run this, the code times out, and if I use seff my_pid in the cmd line to check my Memory and CPU usage for the job, I get:

Job ID: my_pid
Cluster: mycluster
User/Group: myuser/internal
State: TIMEOUT (exit code 0)
Nodes: 4
Cores per node: 41
CPU Utilized: 00:00:10
CPU Efficiency: 0.00% of 656-01:43:52 core-walltime
Job Wall-clock time: 4-00:00:38
Memory Utilized: 220.20 GB
Memory Efficiency: 17.19% of 1.25 TB

What’s weird to me is that the CPU Utilized is so low - is this really saying I only used 10s of CPU time total?

If that’s the case, I can only think of two scenarios:

  1. seff is only measuring the CPU time on one core. Indeed I over-request the number of cores (2x the necessary number), as a quick and nasty way of escaping out-of-memory errors. Perhaps this is just measuring the usage of an unutilised core?

  2. Something - perhaps loading packages and other files using @everywhere - is taking a really long time, and timing out the job before the calculations begin.

I was wondering if anyone has any suggestions as to what’s going wrong here, or any workarounds if you’ve encountered this issue before?

Maybe you’re running into @everywhere is slow on HPC with multi-node environment · Issue #39291 · JuliaLang/julia · GitHub or Unexplained slowness in `@everywhere` remotecalls with imports · Issue #44645 · JuliaLang/julia · GitHub?

Yeah I wondered this. I’ll try the workaround, and see what effect that has.

If you can, I’d also try Julia v1.8.0-beta3 or nightly, which should have the fix for those bugs included, so hopefully no special workaround would be needed with them.

@jewh just fyi, this precise issue is why I ended up fixing https://github.com/JuliaLang/julia/pull/44671 . Precompilation helps a bit but you’ll still spend terrible amount of time with just initializing stuff in almost-serial way (see the kinda benchmark in the PR).

The fix (currently in 1.8.0-beta3 as Mose pointed out) won’t allow you to dodge the package loading&precompilation time, but at least everything is going to be fully parallel, and you can very easily improve more with “normal” use of PackageCompiler.jl.

Also, why not use ClusterManagers.jl and addprocs_slurm? IIRC julia -p81 ... will spawn all processes on a single node. (What I usually do is documented e.g. here with the sbatch script here)