Unexpected OutOfMemory error on HPC

Hello,
I am using an HPC cluster to run a code that involves solving 2 huge sparse linear systems. I was able to test the code on my windows machine with 16GB memory. I used the macro @time to monitor the memory allocations at the critical places (Matrix filling up, Matrix multiplication, system solving … etc).

0.573989 seconds (143.03 k allocations: 710.180 MiB, 22.39% gc time)
0.568460 seconds (71.21 k allocations: 1.865 GiB, 0.75% gc time)
9.861626 seconds (4.70 M allocations: 1.018 GiB)
0.177318 seconds (275.76 k allocations: 120.454 MiB, 19.67% gc time)
0.220135 seconds (82.13 k allocations: 39.772 MiB)
0.005179 seconds (123 allocations: 8.219 KiB)

I was also able to optimize the number of allocations by using dropzeros!(..) and other features of sparse matrices. In general, the code was consuming up to 7 GB of my meomry on windows. However, when I moved to the cluster, I allocated 10 GB for the same code and I got OutOfMemory error. I increased it up to 20 GB and it also crashed! It is only working when I use 25 GB. I made a plot for the mem consumption in kb measured by a tool in our cluster and I got this graph.

Any suggestion?

Can you post the code that’s doing the bulk of the work?

This may not be your issue, but note that if you’re running with addprocs() on the same node, you will be copying all data to each process. This means that, if you have an 8-core node, and you run addprocs(7) on it, you will have 8 separate copies of your data as each process will get its own.

Without knowing more about how you set up your cluster jobs and your multiprocessing environment, it’s hard to say what the problem is.

Thanks for your prompt answer. I even tried using 1 core only on a single node, and allocated 20 GB to this core only … didn’t work!

If you have code and more details of your setup, that would help.

1 Like
@time(NNN =α^(-2) * C * C_t)
@time(A__ = B * B_t + NNN)
@time(W = A__ \ b_)
@time(V = -B_t * W)
@time(η = C \ -(b_ + B * V))
@time(y = reshape(V, N, M))

where:
C : Sparse matrix of dims (659934,9999) , C_t : its transpose
B : Sparse matrix of dims (659934,660000) , B_t : its transpose
α : scalar

… and how have you set up your distributed environment?

Do you happen to know which metric of “mem consumption” the cluster is reporting? Is it resident set size, virtual memory reservation, or …?

Good questions from @traktofon
Also ask your systems managers if the job is being run within a cgroup or container.
It looks to be, since you are allocating a certain amount of memory.

Yes, that’s indeed right. The way I allocate my resources is as follows:

#PBS -l walltime  .... 
#PBS -l pmem=  ....  //memory per core, it is also possible to use mem=... which allocates mem per node

I don’t really know exactly as I don’t have access to this data in real-time. All I have is the total vmem and mem usage reported in the stdout file. One example from a job that cached is the following:

Resource List: nodes=1:ppn=1,pmem=15gb,walltime=00:05:00,neednodes=1:ppn=1
Resources Used: cput=00:02:28,vmem=10906160kb,walltime=00:02:39,mem=7072720kb,energy_used=0

Really, really sorry to ask this.
Can you cut and paste the Out of Memory error? I think this is the PBS mechanism which is terminating the job - the job here is not being run within a cgroup and it is the PBS daemon which is looking at memory use and terminating the job.

You can get a detailed log output from the PBS job log on the first compute node, but you probably have to be a root user to do this.

But I must say this does not really help us with your Julia issue of more memory use on a Linux cluster as opposed to Windows.

Can you share your PBS job submission script?
Also on the compute nodes how many cores are there and how much memory?
Do you know if hyperthreading is enabled - as far as I know PBS is not aware of hyperthreading.
I could be wrong!

I forgot to clarify some distinction between two scenarios, the first one, when I (think) that I allocated sufficient memory based on what I tested on Windows --> this shows the OutOfMemory error as:

[ Info:  started timer at: 2020-04-01T15:14:48.981
ERROR: LoadError: OutOfMemoryError()
Stacktrace:
 [1] Array at ./boot.jl:404 [inlined]
 [2] spmatmul(::SparseMatrixCSC{Float64,Int64}, ::SparseMatrixCSC{Float64,Int64}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/SparseArrays/src/linalg.jl:212
 [3] *(::SparseMatrixCSC{Float64,Int64}, ::SparseMatrixCSC{Float64,Int64}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/SparseArrays/src/linalg.jl:187
 [4] LSS() at /ddn1/vol1/site_scratch/../../../..//1500/LSS.jl:130
 [5] top-level scope at /ddn1/vol1/site_scratch//../../../../1500/Main.jl:80
 [6] include at ./boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1105
 [8] include(::Module, ::String) at ./Base.jl:31
 [9] exec_options(::Base.JLOptions) at ./client.jl:287
 [10] _start() at ./client.jl:460
in expression starting at /ddn1/vol1/site_scratch/../../../../1500/Main.jl:80

I scaled sown my problem (smaller size of the linear systems), ans I was able to run this version, but with mem allocated much much higher that expected (i.e. 25GB). When I try to make the pmem smaller (i.e pmem=10gb), the scheduler automatically kills my job without showing any error massage from julia itself (2nd scenario).

The common problem is that Julia is overusing the memory in a bizarre way that leads Julia to kill itself in the first case, and leads the scheduler to kill the job when memory over-usage is detected in the 2nd case.

Regarding the job script, here is an example:

#!/bin/bash -l

#PBS -l nodes=1:ppn=1 
#PBS -l pmem=15gb
#PBS -l walltime=00:05:00 


module purge
module load Julia/1.3.1 
module load monitor

cd $PBS_O_WORKDIR

monitor -d 1 julia Main.jl 2 200

I do not think that user limits are the problem here. You could put this in the start of the job script to check:
ulimit -a

Also put this in your job scrit at the start:
free
sysctl -a | grep mem

I am definitely leading everyone up a wrong path here. However the memory overcommit may be DISABLED on an HPC cluster - for good reasons.

1 Like

Alright! This is what I got (alongside with some permission denied error massages in the stderr, I don’t think I have the right to use sysctrl).

core file size          (blocks, -c) 62500
data seg size           (kbytes, -d) 26214400
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 770478
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 26214400
open files                      (-n) 16384
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 770478
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
              total        used        free      shared  buff/cache   available
Mem:      197734444     7752876   169395028         308    20586540   189197472
Swap:       2097148      247784     1849364
net.core.optmem_max = 20480
net.core.rmem_default = 212992
net.core.rmem_max = 67108864
net.core.wmem_default = 212992
net.core.wmem_max = 67108864
net.ipv4.igmp_max_memberships = 20
net.ipv4.tcp_mem = 4631163	6174885	9262326
net.ipv4.tcp_rmem = 4096	87380	33554432
net.ipv4.tcp_wmem = 4096	65536	33554432
net.ipv4.udp_mem = 4633461	6177950	9266922
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_wmem_min = 4096
vm.lowmem_reserve_ratio = 256	256	32
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.nr_hugepages_mempolicy = 0
vm.overcommit_memory = 0

I found this topic that might be related …