Poor performance of garbage collection in multi-threaded application

Hi,
I have a tricky problem bothering me:
I am using julia 1.5 with HTTP.jl and Mux.jl, I added multi-threading support in a manner as MusicAlbums.jl/Workers.jl at master · quinnj/MusicAlbums.jl (github.com), but compared to single-threaded server the garbage collector kicks in too infrequently. I have 9 threads as workers and 1 thread as the main, accepting the requests.
The problem is, it looks like I have memory leaks/ garbage collector is sometimes not kicking in much and it takes several hours or days to clean-up significant portion of memory.

In the processing of request I download data, perform inference using ensemble of neural networks and then update some table and return the result. When there are no incoming requests, server is taking ~30GB RAM, but when there are requests incoming, it goes up to 50GB and more.
This is chart of 2 replicas, orange max. ram, green average between them.


In the image in the left we can see that after ~4 hours of memory growing the GC performs some significant cleanup (~5GB of RAM), after another ~4 hours it cleans up another ~5GB of RAM, but then nothing, the memory just keeps growing.
When I had the 12 threads as workers, the memory grew even faster, and GC was not cleaning it up.
How does GC relate to the multithreading?
It seems if there are threads doing some work and are busy all the time, the GC waits and it can wait several days until it really cleans up larger portion of memory.
But I’m not sure if I’m interpreting it correctly.
Is there a way how to force the GC to run more often?
Is there a way how I can suspend all threads from time to time so GC would run more often?
Because from my observations, next 2 days after the last large cleanup only small portions of memory get cleaned up, nothing big.
I would rather e.g. have 1 second when server waits and cleans up the memory than this infinite growth without significant cleanup.

1 Like

Did you try doing it manually in somewhere appropriate?

GC.gc()

https://docs.julialang.org/en/v1/base/base/#Base.GC.gc

Or sprinkle

GC.safepoint()

Inserts a point in the program where garbage collection may run. This can be useful in rare cases in multi-threaded programs where some threads are allocating memory (and hence may need to run GC) but other threads are doing only simple operations (no allocation, task switches, or I/O). Calling this function periodically in non-allocating threads allows garbage collection to run.

1 Like

Or, perfect, I haven’t tried it, thanks!

If you are on Linux I would be interested in some stats from your application. Using 1.8-dev you can build Julia with usdt probes enabled Instrumenting Julia with DTrace, and bpftrace · The Julia Language and if you use julia/gc_all.bt at a0093d2ffb7ba1d35071543e581c26b96a772d39 · JuliaLang/julia · GitHub you should get a histogram off the time GC took and how much of that was stop-the-world.

3 Likes

In 1.8-dev there’s also GC.enable_logging(true), which will enable printing to stderr of some GC stats every time a garbage collection is done. It shows time spent on GC, plus amount collected. If you see far less being collected than you expect (or at a lower rate), then you might have some leftover references somewhere, keeping data alive.

3 Likes

Thanks. Do the probes show also why it hasn’t ran/which generations were collected? Because my issue is not that we would be spending too much time in GC, but rather GC not collecting enough garbage.

I can’t easily try new julia because we use some older versions of packages for compatibility reasons, but once I’ll manage to do and make it run on julia 1.8-dev, I’ll post results.

Maybe you could double check if the references to unused variables are getting effectively freed (if GC can demonstrate the variables won’t be used anymore).

Well, no, those details aren’t included. But you might be able to use the amount that was garbage-collected to compare with what you would expect. The point is that if there is a reference somewhere in your code to data you think should be garbage-collected but isn’t then it’s not the GCs fault. So the “GC not collecting enough garbage” could point to undesired behaviour (perhaps due to mult-threading, or even a bug in the GC), but also to an issue with your code in keeping alive data that shouldn’t be. And it would be good to figure out which of those things is actually happening.

Have you tried the code in single-threaded mode with manual GC.gc() calls added? If that still doesn’t free as much as it should then I would suspect there’s an issue in your code keeping data referenced, and not in the GC itself.

Edit: noticed that you mentioned single-threaded code in your original post. Can you easily compare that single-threaded code with a run using only 2 threads, which already shows a significant difference in memory not collected?

1 Like

And how can I double check it?

I have been having similar issues when using Distributed.jl worker processes to run many small tasks.

Although the output of each process is small and collected in a concatenated array using pmap, as the pmap process continues more and more memory will be consumed, eventually bringing julia (and the machine) to a halt.

Even in smaller test cases where not all memory is consumed and pmap completes successfully, the machine’s memory will still remain occupied. If I terminate all worker processes this will free up a large chunk of memory, but even after that at least 50% of RAM will be used by the main julia process, even with nothing running (and no large variables). For reference, the machine I am using has 32 CPUs and 64GB RAM (running Julia on ubuntu 20.04 via a docker container).

I have read about the trick of sprinkling GC.gc() throughout functions that will run in parallel, and in some cases it alleviates the memory usage enough to make things work, but still the latent memory usage persists and it is impossible to free it without restarting julia. I’ve noticed too that it occurs with many other processes that utilise multiple CPUS, even if they are not using Distributed.jl. For a MWE of sorts, I can show it using CSV.jl, which will use all available CPUS to read in large files:

##start julia such that multiple processes are available i.e. Threads.nthreads()>1
Threads.nthreads()
using CSV,DataFrames
## memory usage in GB after starting julia and loading packages:
used_mem() = (println("$(round((Sys.total_memory()-Sys.free_memory())/2^30 -9))G used"))
used_mem()

## create large file for CSV.jl to read (you can adjust n as appropriate for your machine, this maxes out at about 7.5GB on my machine) 
n = 100000000
CSV.write("test.csv",DataFrame(repeat([(1,1,1)],n)))
used_mem()
##during the above process, memory peaks, but running garbage collection returns it to original state
GC.gc()
used_mem()

##now we load in test.csv, using all available CPUS
CSV.read("test.csv",DataFrame)
## there is now a very large dataframe in ans, so memory usage is high again
used_mem()
##clear ans and collect garbage
1+1
GC.gc()

### if Threads.nthreads()>1, memory usage is still very high, even with nothing running and no big variables to explain it
varinfo()
used_mem()


If you run the above code with multiple processes available, the final memory usage will be larger than the original memory usage, but if you run it on just a single CPU (by setting env variable JULIA_NUM_THREADS=1 before starting Julia) both usage values will be similar. It is also easier to see this happening with htop running alongside Julia.
Obviously on a larger scale this gets out of hand (as you can test by increasing n if your system allows it).

TLDR; although there might be something in user code that is exacerbating this, it does seem that garbage collection is not properly operating on multi processes, whether that be Distributed.jl worker processes or in multi-threading applications.

I don’t have any tool for that, but the idea is to, for example, if you have a variable containing a large amount of data, be sure that it is local to a function which exits when the data is no longer needed, or assign the variable name to something else (even something like data = nothing), such that the previous buffer is liberated for garbage collection.

I have checked the discussion in Garbage collection not aggressive enough on Slurm Cluster - Specific Domains / Julia at Scale - JuliaLang
specifically the

thr=0.02
rand(Uniform(0,1)) < thr && GC.gc();

And I’m trying now something similar now. Interestingly, when running the GC.gc() approximately every 100s, it does not seem to perform some deep cleaning. The low part on left is when there are no incoming requests, and then it grows under load and stays there.

I’ll also try the Base.gc_bytes() if it would give some more information.

I will yet try to sprinkle more GC.safepoint().

All the code I have keeps large things only in local variables, but it’s weird that even after the traffic stops, lots of memory is still allocated although I have called explicitly GC.gc() multiple times since then.

Comparing the single-threaded code with only 2 threads seems like a good idea, I’ll take a look at that.

Julia uses a generational collector, I think if you call GC.gc(true) it does a full collect, otherwise just the recent generation. See help text on gc

? gc

A full collection is the default:

help?> GC.gc
  GC.gc([full=true])

  Perform garbage collection. The argument full determines the kind of collection: A full collection (default) sweeps all objects, which makes the next GC scan much slower, while an incremental collection may only sweep so-called young objects.
1 Like

I’d guess that somehow it’s holding on to a bunch of objects that remain live due to some reference somewhere, possibly inside a library you are calling?

Yes, that is the most probable culprit, I’m now investigating, because I really might have references held in global state and that’s why GC is not collecting it.

2 Likes

I added ccall(:malloc_trim, Cvoid, (Cint,), 0) after the GC.gc() call and I’m calling the GC periodically, approximately few times per hour, and I was able to free some memory.
Not everything has been freed, but this is definitely much better than what I have observed previously, where I was not calling the malloc_trim.

I suppose it’s somehow related to the glibc is optimized for memory allocations microbenchmarks · Issue #42566 · JuliaLang/julia (github.com) and the other parts might be actually some references, but it seems not all of the allocated memory was references in my code.

Also putting the GC.safepoint() to all threads seemed to help a bit.

1 Like

Related to the…? :drum:

IIRC your next stop (besides leaks) could be memory fragmentation which might make it necessary to provide quite a bit more memory than really used in the worst case…