Running into an odd error and not sure where to begin, wondering if you folks had any advice.
I’m using Distributed.jl to parallelise iterations of a CPU-intensive function, with my code looking like this:
# code in myfile.jl
@everywhere begin
using MyPackage
include("UsefulFunctions.jl")
end
@everywhere my_func(params)
# do lots of computations here
return my_vals
end
@everywhere myparams = # 81-long vector of parameter objects
result = pmap(params -> my_func(params), myparams) # Returns a vector of my_vals values
# write out result as a text file
I’m running this on an HPC cluster using Slurm as a scheduler, with the following batch script:
#!/bin/bash
# set the number of nodes.
#SBATCH --nodes=4
# set the number of CPUs required.
#SBATCH --ntasks-per-node=41
# set the amount of memory needed for each CPU.
#SBATCH --mem-per-cpu=8000
# set max wallclock time (hh:mm:ss).
#SBATCH --time=96:00:00
# set the time partition for the job.
#SBATCH --partition=long
# set name of job (AND DATE!)
#SBATCH --job-name=MyLongJob
# mail alert at start, end, and abortion of execution
#SBATCH --mail-type=ALL
# send mail to this address
#SBATCH --mail-user=my_email@my_inst.ac.uk
# run the application
module load Julia
julia -p 81 myfile.jl
When I run this, the code times out, and if I use seff my_pid
in the cmd line to check my Memory and CPU usage for the job, I get:
Job ID: my_pid
Cluster: mycluster
User/Group: myuser/internal
State: TIMEOUT (exit code 0)
Nodes: 4
Cores per node: 41
CPU Utilized: 00:00:10
CPU Efficiency: 0.00% of 656-01:43:52 core-walltime
Job Wall-clock time: 4-00:00:38
Memory Utilized: 220.20 GB
Memory Efficiency: 17.19% of 1.25 TB
What’s weird to me is that the CPU Utilized is so low - is this really saying I only used 10s of CPU time total?
If that’s the case, I can only think of two scenarios:
-
seff
is only measuring the CPU time on one core. Indeed I over-request the number of cores (2x the necessary number), as a quick and nasty way of escaping out-of-memory errors. Perhaps this is just measuring the usage of an unutilised core? -
Something - perhaps loading packages and other files using
@everywhere
- is taking a really long time, and timing out the job before the calculations begin.
I was wondering if anyone has any suggestions as to what’s going wrong here, or any workarounds if you’ve encountered this issue before?