Help with optimizing GC time with large objects in memory

jthomasnull · November 8, 2018, 10:50pm

I need to keep a large object (a DataFrame, like 20GB large) in memory, and as a result I’m seeing very large GC times. An example:

function t1(indexes::Vector{Int64},ranges::Vector{UnitRange{Int64}})::Vector{Vector{Int64}}
map(x->indexes,ranges)
end

n1 = 1000000;
indexes = shuffle(collect(1:n1*10;));
ranges = map(x->((x-1)n1+1):xn1,[1:10;]);
@time index_subsets = t1(indexes,ranges);

If I run this code with my objects in memory, I get:
julia> @time index_subsets = t1(indexes,ranges);
9.114982 seconds (27 allocations: 76.295 MiB, 99.87% gc time)

If I run it in a largely empty REPL I get:
julia> @time index_subsets = t1(indexes,ranges);
0.069398 seconds (27 allocations: 76.295 MiB, 62.67% gc time)

Same allocations, but 100x longer GC times. I know that a GC call scans the heap, but I’m not sure what I can do to optimize things – is there a guide or documentation on how to minimize GC time with large objects in memory?

y4lu · November 9, 2018, 12:12am

Friendly reminder to ``` quote ``` your code

I’m not sure, but it could be the ‘benchmarking in global scope’ gotcha
I usually seem to get away with it somehow, it might be if the objects are small enough it doesn’t bite?

Will storing the big dataframe in a seperate module-space global avoid the gc scanning it?

The recommended solution is to wrap the benchmark code in a function, eg

using Random
fx(n) = reshape(randperm(n * 10), 10,: ) ##gives a similar 10x1000000 mat 
@time index_subset = fx(1000000)
> 0.73 sec

jthomasnull · November 9, 2018, 7:49pm

Thanks for the response; I will quote my code in the future. It’s defnitely not the benchmarking of global scope gotcha, it kicks in any time anything allocates. I will experiment with putting my big data in a separate module-space – I hadn’t heard that that mattered before, is there some reference I could look at about how the GC treats module space objects different from those in the global repl scope? My workflow depends on lots of interactive exploration applied to the large data frame, so it’s mildly awkward to put it in a module but if it solves this problem it’s worth it.

y4lu · November 10, 2018, 3:32am

Ctrl + F “module” in julia/gc.c at master · JuliaLang/julia · GitHub, but its probable the module space trick still won’t work from the global repl (maybe in the future). There’s a chance it will if the other code is also in another seperate module, like [module runcode] <- [main] -> [module datastore], so the dataframe is not in a sub-branch.

Memory mapped files could be a better option
added: How many columns?

Topic		Replies	Views
Methods to reduce gc time? Performance	7	7979	February 7, 2018
The relationship between exsting julia objects and gc time? New to Julia	1	278	November 20, 2022
Large Garbage Collection in for loop if slicing vector Performance	10	558	March 22, 2022
Understanding how GC is triggered Internals & Design question	9	1347	February 13, 2021
GC occurs at the worst time in tight loop (Garbage Collection) Performance question	93	3201	November 7, 2023

Help with optimizing GC time with large objects in memory

Related topics