Garbage collector behaviour when memory is almost full

rssdev10 · August 29, 2019, 3:39pm

I have a script for calculation of millions of strings distances with StringDistances.jl. I’m using 64 CPUs virtual machine with 240GB RAM and found that my script was killed by out of memory. I cleared the script to minimize object allocations but Julia still takes up to 240GB. Memory consumption in time is far from linear but rather a saw. Looks like garbage collector is switched on after few minutes but next is completely switched off. Also, there is dependency on how many threads I’m running. When the threads number is much less than number of available CPUs it works. E.g. 30 from 64. But when I’m running the script with 60 threads on 64 CPUs VM, I’m getting out of memory after some time. The only way I found how to finish the calculation properly is to add explicit memory control:

@threads for item in list
  # do something useful with `item`
  # In my case this part is calculated few minutes
  # ...

  if (Sys.free_memory() / Sys.total_memory() < 0.1)
    GC.gc()
    sleep(10)
  end
end

And running with JULIA_NUM_THREADS=60 julia --project=@. src/...

After ~10 hours of calculation with 60 threads I got results. And I can say that my real script’s memory consumption is less that 5-10 GB but not 240 GB of available RAM.

I found similar issue https://github.com/JuliaLang/julia/issues/6103 but looks it is still actual.

Julia 1.2, CentOS 7

So, the questions are how to avoid that explicit code in my script and does Julia do automatic cleaning of memory instead of collecting garbage and be killed by operational system by out of memory?

jling · August 29, 2019, 5:23pm

while julia for sure should handle this, I just want to ask some simple question since no MWE is provided, can you pre-allocate? if all 60 threads are all comparing the biggest stings, what’s the estimated memory consumption?

jpsamaroo · August 29, 2019, 6:30pm

I noticed the following issue posted to Julia’s issue tracker about GC under multithreaded situations: https://github.com/JuliaLang/julia/issues/33097

Essentially, because the GC pass requires all threads to hit a safepoint (and then pause), it’s possible that one or more “runaway” threads keep the GC pass from occurring, and so those threads can keep accumulating garbage that isn’t GC’d before hitting the memory limit. The PR linked in that issue (https://github.com/JuliaLang/julia/pull/33092) provides a means to explicitly do a cheap safepoint check in the middle of compute-bound code. Maybe it’s worth giving that PR a try?

rssdev10 · August 29, 2019, 7:19pm

@jling, yes, I’m pre-allocating and even pre-processing strings before doing distance calculations. Actually the task is elementary. Simply speaking I have two lists of strings with ~10-30 characters. In the mentioned case with 10h x 60 threads it was 20k + 70k strings. And I’m using RatcliffObershelp() metrics for distance calculation. Even sorting of tokens I’m doing once for these stings. 60 threads are going through 20k list.

@jpsamaroo, thanks. It looks like I used old style but almost proper way how to fix it temporarily The task where I found the issue is a single run task. So, definitely I can check it with mentioned by you safepoint. At the same time I hope it will be fixed soon.

lmiq · August 15, 2020, 5:35pm

I had the same type of problem with a parallel computation of mine. The default garbage collection of Julia was not being safe enough when many threads were being launched.

This workaround solved the problem completely, but why did you add the sleep(10) directive?

rssdev10 · August 15, 2020, 8:34pm

The only reason why I added sleep() is to release a current processor time slice scheduled by an operational system. 10 is less than a typical time slice. In that case, I’m expecting that a thread with a garbage collector will be able to do something useful.

danrib07 · June 23, 2021, 8:06pm

Hi,

I think this thread can be helpful on an issue I’m working on with segfaults in julia. However I did not quite understand the purpose of sleep(10). Could you elaborate on it a little more?

Thank you!!

rssdev10 · June 24, 2021, 4:59am

When a thread is doing sleep, it means that it is not participating in the tasks scheduling. 10 milliseconds is less than minimal time slice in most operational systems but it is enough to switch a processor core to another thread. The garbage collection issue is happening when a system is high loaded. So, excluding the current computational thread from the cycle of task scheduling increases probability of finishing utility work by auxiliary threads. Including the background garbage collection.

Topic		Replies	Views
Poor performance of garbage collection in multi-threaded application Julia at Scale garbage-collection	22	5438	February 3, 2022
Garbage collection not aggressive enough on Slurm Cluster Julia at Scale multithreading , garbage-collection	17	2911	August 4, 2021
Calling for people using multithreading heavily, memory fragmentation problem General Usage multithreading , memory , memory-allocation , garbage-collection	1	1112	October 3, 2020
Weird behaviour of GC with multithreaded array-access General Usage multithreading , garbage-collection	2	195	January 31, 2025
Triggerring GC unexpectedly New to Julia	9	386	September 5, 2022

Garbage collector behaviour when memory is almost full

Related topics