Hello,
I am using an HPC cluster to run a code that involves solving 2 huge sparse linear systems. I was able to test the code on my windows machine with 16GB memory. I used the macro @time to monitor the memory allocations at the critical places (Matrix filling up, Matrix multiplication, system solving … etc).
0.573989 seconds (143.03 k allocations: 710.180 MiB, 22.39% gc time)
0.568460 seconds (71.21 k allocations: 1.865 GiB, 0.75% gc time)
9.861626 seconds (4.70 M allocations: 1.018 GiB)
0.177318 seconds (275.76 k allocations: 120.454 MiB, 19.67% gc time)
0.220135 seconds (82.13 k allocations: 39.772 MiB)
0.005179 seconds (123 allocations: 8.219 KiB)
I was also able to optimize the number of allocations by using dropzeros!(..) and other features of sparse matrices. In general, the code was consuming up to 7 GB of my meomry on windows. However, when I moved to the cluster, I allocated 10 GB for the same code and I got OutOfMemory error. I increased it up to 20 GB and it also crashed! It is only working when I use 25 GB. I made a plot for the mem consumption in kb measured by a tool in our cluster and I got this graph.
This may not be your issue, but note that if you’re running with addprocs() on the same node, you will be copying all data to each process. This means that, if you have an 8-core node, and you run addprocs(7) on it, you will have 8 separate copies of your data as each process will get its own.
Without knowing more about how you set up your cluster jobs and your multiprocessing environment, it’s hard to say what the problem is.
Good questions from @traktofon
Also ask your systems managers if the job is being run within a cgroup or container.
It looks to be, since you are allocating a certain amount of memory.
I don’t really know exactly as I don’t have access to this data in real-time. All I have is the total vmem and mem usage reported in the stdout file. One example from a job that cached is the following:
Really, really sorry to ask this.
Can you cut and paste the Out of Memory error? I think this is the PBS mechanism which is terminating the job - the job here is not being run within a cgroup and it is the PBS daemon which is looking at memory use and terminating the job.
You can get a detailed log output from the PBS job log on the first compute node, but you probably have to be a root user to do this.
But I must say this does not really help us with your Julia issue of more memory use on a Linux cluster as opposed to Windows.
Can you share your PBS job submission script?
Also on the compute nodes how many cores are there and how much memory?
Do you know if hyperthreading is enabled - as far as I know PBS is not aware of hyperthreading.
I could be wrong!
I forgot to clarify some distinction between two scenarios, the first one, when I (think) that I allocated sufficient memory based on what I tested on Windows → this shows the OutOfMemory error as:
[ Info: started timer at: 2020-04-01T15:14:48.981
ERROR: LoadError: OutOfMemoryError()
Stacktrace:
[1] Array at ./boot.jl:404 [inlined]
[2] spmatmul(::SparseMatrixCSC{Float64,Int64}, ::SparseMatrixCSC{Float64,Int64}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/SparseArrays/src/linalg.jl:212
[3] *(::SparseMatrixCSC{Float64,Int64}, ::SparseMatrixCSC{Float64,Int64}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/SparseArrays/src/linalg.jl:187
[4] LSS() at /ddn1/vol1/site_scratch/../../../..//1500/LSS.jl:130
[5] top-level scope at /ddn1/vol1/site_scratch//../../../../1500/Main.jl:80
[6] include at ./boot.jl:328 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1105
[8] include(::Module, ::String) at ./Base.jl:31
[9] exec_options(::Base.JLOptions) at ./client.jl:287
[10] _start() at ./client.jl:460
in expression starting at /ddn1/vol1/site_scratch/../../../../1500/Main.jl:80
I scaled sown my problem (smaller size of the linear systems), ans I was able to run this version, but with mem allocated much much higher that expected (i.e. 25GB). When I try to make the pmem smaller (i.e pmem=10gb), the scheduler automatically kills my job without showing any error massage from julia itself (2nd scenario).
The common problem is that Julia is overusing the memory in a bizarre way that leads Julia to kill itself in the first case, and leads the scheduler to kill the job when memory over-usage is detected in the 2nd case.