GC going nuts

I have workload where I process larger datasets (a few GB) iteratively. After adjusting some parameters the processing is repeated within the same REPL session. After two or three rounds all slows down and switching on GC-logging one sees that more than 90% of the runtime is consumed by repeated full GCs, which nearly never free much memory. This is while the total size of the process stays a about 10GB (at least this is what the task manager shows) on a 32 GB Windows machine. The typical behavior is that the first processing round trigger only a few incremental collections, on the second round I get a high rate (about 8/s) of incremental collections and on the third round it switches to full collections. Larger amounts of memory are only freed on oen or tow of these, most of the time they are completely useless.
This is for example the GC logging output and the @timev of a processing step which normally (on the first run) takes less than a second. The about 1GB it allocates are mostly output data which will be used later, so there is not much to free. And more than 10GB of free RAM are available according to the Windows Task-Manager.

GC: pause 474.99ms. collected 107.624664MB. full 
GC: pause 470.20ms. collected 0.003744MB. full 
GC: pause 463.51ms. collected 0.496024MB. full 
GC: pause 507.51ms. collected 0.016096MB. full 
GC: pause 489.35ms. collected 173.481960MB. full 
GC: pause 479.97ms. collected 0.000000MB. full 
GC: pause 493.98ms. collected 0.025552MB. full 
GC: pause 486.61ms. collected 0.000000MB. full 
GC: pause 469.64ms. collected 0.259736MB. full 
GC: pause 470.56ms. collected 0.000000MB. full 
GC: pause 470.62ms. collected 52.879934MB. full 
  6.088668 seconds (542 allocations: 1.046 GiB, 86.67% gc time)
elapsed time (ns):  6088667800
gc time (ns):       5276923600
bytes allocated:    1123004200
pool allocs:        470
non-pool GC allocs: 31
malloc() calls:     41
free() calls:       26
minor collections:  0
full collections:   11

This is on 1.9.0-beta3 for Windows, but I remember that on 1.8 it even triggers more easily.

My solution up to now is to disable the GC during Processing, switch it on afterwards and force a GC. But this is of course a bit risky and hacky…

Is there some knob to tune how often a GC (especially a full one) is triggered?

Maybe this would be helpful?

help?> GC.enable
  GC.enable(on::Bool)

  Control whether garbage collection is enabled using a boolean argument (true for enabled, false for disabled). Return previous GC state.

  │ Warning
  │
  │  Disabling garbage collection should be used only with caution, as it can cause memory use to grow without bound.

help?> GC.gc
  GC.gc([full=true])

  Perform garbage collection. The argument full determines the kind of collection: A full collection (default) sweeps all objects, which makes the next GC scan much slower, while an incremental
  collection may only sweep so-called young objects.

  │ Warning
  │
  │  Excessive use will likely lead to poor performance.
1 Like

Is this within a VM or some kind of virtualization? Julia can detect the memory available to its VM if it’s running in one, so if that amount is lower enough than what hardware allows I think it would cause the issues you experience.

Another thing that could help, I believe, is making your allocated data structures somehow more GC-friendly. This would probably mean less pointer-indirection.

That’s what I’m doing currently. But its rather tedious to sprinkle the code with GC on/off sections and I was looking for a better solution. But most probably this has to be regarded as a sort of performance bug. Spending 90% of time in GC while there is nothing to collect seems counterproductive.

It#s directly on a Windows maschine, no VM involved. And the structures, espeiclly the large arrays, are toplevel in a larger mutable struct. So there is no deep nesting involved.

Are you creating lots of Strings? They are known to be tough on the GC, which is why InlineStrings and friends were created.

No, only large 2-dim arrays of size 1000x40000 and a few vectors of size 1000 of 40000 resp. No heavy creation of small objects

If you are able to do a somewhat standalone reproducer I think a Julia issue would be nice since it could be used as a benchmark for further improvements to the GC.

4 Likes

For my setup (Win10 32 GB RAM) the following generic code triggers the high frequency loop of full collections, after being executed two or three times in one session. It happens for a smaller dataset in the VSCode REPL than for a standalone terminal session (see comment in the code). After playing a bit with different sizes and settings I’ve got the impression that the eager full GC-mode is entered when the working set size of the Julia process reaches half of the available real RAM (not counting swap). But that may be just accidentally…

#import some typical packages to populate the memory map with small objects  
using Dates 
using Logging
using Printf
using Images
import Gtk
import Cairo
using LsqFit
import Plots

mutable struct BigData
    maindata::Matrix{Float32} 
    auxdata1::Matrix{Float32}
    auxdata2::Matrix{Float32}
    auxdata3::Matrix{Float32}
    flags::Matrix{UInt16}
    time::Vector{Float64}
    vscale::Vector{Float64}
end

function process_data(m, n)
    return BigData(zeros(Float32, m, n), zeros(Float32, m, n), 
                   zeros(Float32, m, n), zeros(Float32, m, n),
                   zeros(UInt16, m, n), zeros(n), zeros(m))
end

GC.enable_logging(true)

const m = 1000
const n = 40000
const ndatasets = 20 # this is for the VSCode REPL. Set it to 30 on a plain terminal

#do some warmup
results = []
for i = 1:ndatasets
    push!(results, process_data(m, n))
end

results = []
@timev for i = 1:ndatasets
    push!(results, process_data(m, n))
end
nothing
2 Likes

cc @vchuravy — new GC benchmark just dropped

1 Like

I looked about around in the Julia source code and think that I now have an idea, what is going on:
In gc.c the variable max_total_memory control the super-eager GC-mode and is set on startup to 0.7 times the available memory at that time and never changed afterwards. This has the following effects:

  • While as a general rule this makes sense, especially on low RAM machines, it is far from optimal, if you invested in a lot of RAM specially for the purpose to run large Julia jobs.
  • Since max_total_memory is only set on startup and never adjusted, it does not help to close other memory consuming operations to give to already running Julia process more memory.
  • It explains, why the threshold for the high rate full gc-mode seems not very reproducible at first look, since it depends on the free mem (or what ever uv_get_available_memory() returns) at the exact point of the process start.
  • For my example above it explains, why the Julia process gets stuck at a working setsize of about half of the RAM
2 Likes

Indeed, looking at the libuv docs, I think it would be better to use uv_get_constrained_memory than uv_get_available_memory. Especially now that it’s possible to set additional memory limits on the command line with --heap_size_hint.

On my machine I get for the various uv_get_XXX_mem functions:

total mem = 34161102848
constraind mem = 0
avail. mem = 27067404288
free mem = 27067404288

So this makes no difference. It’ the 70% limit which renders about 8 GB of my RAM unusable to Julia

It seems there are two separate issues here, from what you found out:

  1. Julia is basing its heap size limits on memory that’s available when it starts instead of on total (possibly constrained) available memory, which seems perhaps too arbitrary and unnecessary given that we now have --heap_size_hint
  2. Julia limits the memory usage to the very conservative 70%. I think this should be tunable (if it’s not already), because the value that makes the most sense will depend on the application.

Yes, but both combine in an unfortunate way. At the moment, for what I can see, the --heap-size-hint is limited against the 70% of available memory. So as it is, you can use the command line parameter only to lower the limit.

1 Like

So, what could be possible ways forward here? What I now implemented on my local copy of Julia is to be more permissive than 70% of available memory, if the user specified a heap size by setting –heat-size-hint. In the code below it is then limited to the total memory. Together with the typical swap file sizes this should still be relatively save. A user using –heat-size-hint is expected to be experienced enough to know what he/she is doing…
Here is a patch to do this as starting point for discussion:

From 42541dd3538b664ac2b4268f4ae2f3e016c4d890 Mon Sep 17 00:00:00 2001
From: Martin Wirth <martin.wirth@dlr.de>
Date: Thu, 26 Jan 2023 09:46:25 +0100
Subject: [PATCH] Allow a user specified --heap-size-hint to go beyond 70% of
 available memory

---
 src/gc.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/gc.c b/src/gc.c
index 17ad622900..1ba686a219 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -652,6 +652,7 @@ static const size_t default_collect_interval = 5600 * 1024 * sizeof(void*);
 static const size_t max_collect_interval = 1250000000UL;
 static size_t total_mem;
 // We expose this to the user/ci as jl_gc_set_max_memory
+static const memsize_t default_max_total_memory = (memsize_t) 2 * 1024 * 1024 * 1024 * 1024 * 1024;
 static memsize_t max_total_memory = (memsize_t) 2 * 1024 * 1024 * 1024 * 1024 * 1024;
 #else
 typedef uint32_t memsize_t;
@@ -660,6 +661,7 @@ static const size_t max_collect_interval =  500000000UL;
 // Work really hard to stay within 2GB
 // Alternative is to risk running out of address space
 // on 32 bit architectures.
+static const memsize_t default_max_total_memory = (memsize_t) 2 * 1024 * 1024 * 1024;
 static memsize_t max_total_memory = (memsize_t) 2 * 1024 * 1024 * 1024;
 #endif

@@ -3683,10 +3685,15 @@ void jl_gc_init(void)
     if (constrained_mem > 0 && constrained_mem < total_mem)
         total_mem = constrained_mem;
 #endif
-
+    uint64_t high_water_mark;
+    if (max_total_memory == default_max_total_memory) {
     // We allocate with abandon until we get close to the free memory on the machine.
-    uint64_t free_mem = uv_get_available_memory();
-    uint64_t high_water_mark = free_mem / 10 * 7;  // 70% high water mark
+       uint64_t free_mem = uv_get_available_memory();
+       high_water_mark = free_mem / 10 * 7;  // 70% high water mark
+    } else {
+    // Get closer to the limits, if the user specified a bound by setting --heap-size-hint
+       high_water_mark = uv_get_total_memory();
+    }

     if (high_water_mark < max_total_memory)
        max_total_memory = high_water_mark;
--
2.38.1

Another solution could be to introduce a second GC tuning parameter to set the 70% limit more freely.
And a further form to mitigate the problem would be to rework to low-mem behavior of the GC. To run full GCs at a high rate, even if nearly no memory is freed is a quite hard reaction as it gets very close to a live-lock. A smoother way would be better. But it’s far beyond my knowledge of the GC internals to make suggestions here!

1 Like