Running into an odd error and not sure where to begin, wondering if you folks had any advice.
I’m using Distributed.jl to parallelise iterations of a CPU-intensive function, with my code looking like this:
# code in myfile.jl @everywhere begin using MyPackage include("UsefulFunctions.jl") end @everywhere my_func(params) # do lots of computations here return my_vals end @everywhere myparams = # 81-long vector of parameter objects result = pmap(params -> my_func(params), myparams) # Returns a vector of my_vals values # write out result as a text file
I’m running this on an HPC cluster using Slurm as a scheduler, with the following batch script:
#!/bin/bash # set the number of nodes. #SBATCH --nodes=4 # set the number of CPUs required. #SBATCH --ntasks-per-node=41 # set the amount of memory needed for each CPU. #SBATCH --mem-per-cpu=8000 # set max wallclock time (hh:mm:ss). #SBATCH --time=96:00:00 # set the time partition for the job. #SBATCH --partition=long # set name of job (AND DATE!) #SBATCH --job-name=MyLongJob # mail alert at start, end, and abortion of execution #SBATCH --mail-type=ALL # send mail to this address #SBATCH --mail-user=my_email@my_inst.ac.uk # run the application module load Julia julia -p 81 myfile.jl
When I run this, the code times out, and if I use
seff my_pid in the cmd line to check my Memory and CPU usage for the job, I get:
Job ID: my_pid Cluster: mycluster User/Group: myuser/internal State: TIMEOUT (exit code 0) Nodes: 4 Cores per node: 41 CPU Utilized: 00:00:10 CPU Efficiency: 0.00% of 656-01:43:52 core-walltime Job Wall-clock time: 4-00:00:38 Memory Utilized: 220.20 GB Memory Efficiency: 17.19% of 1.25 TB
What’s weird to me is that the CPU Utilized is so low - is this really saying I only used 10s of CPU time total?
If that’s the case, I can only think of two scenarios:
seffis only measuring the CPU time on one core. Indeed I over-request the number of cores (2x the necessary number), as a quick and nasty way of escaping out-of-memory errors. Perhaps this is just measuring the usage of an unutilised core?
Something - perhaps loading packages and other files using
@everywhere- is taking a really long time, and timing out the job before the calculations begin.
I was wondering if anyone has any suggestions as to what’s going wrong here, or any workarounds if you’ve encountered this issue before?