Help with optimizing GC time with large objects in memory


#1

I need to keep a large object (a DataFrame, like 20GB large) in memory, and as a result I’m seeing very large GC times. An example:

function t1(indexes::Vector{Int64},ranges::Vector{UnitRange{Int64}})::Vector{Vector{Int64}}
map(x->indexes,ranges)
end

n1 = 1000000;
indexes = shuffle(collect(1:n1*10;));
ranges = map(x->((x-1)n1+1):xn1,[1:10;]);
@time index_subsets = t1(indexes,ranges);

If I run this code with my objects in memory, I get:
julia> @time index_subsets = t1(indexes,ranges);
9.114982 seconds (27 allocations: 76.295 MiB, 99.87% gc time)

If I run it in a largely empty REPL I get:
julia> @time index_subsets = t1(indexes,ranges);
0.069398 seconds (27 allocations: 76.295 MiB, 62.67% gc time)

Same allocations, but 100x longer GC times. I know that a GC call scans the heap, but I’m not sure what I can do to optimize things – is there a guide or documentation on how to minimize GC time with large objects in memory?


#2

Friendly reminder to ``` quote ``` your code

I’m not sure, but it could be the ‘benchmarking in global scope’ gotcha
I usually seem to get away with it somehow, it might be if the objects are small enough it doesn’t bite?

Will storing the big dataframe in a seperate module-space global avoid the gc scanning it?


The recommended solution is to wrap the benchmark code in a function, eg

using Random
fx(n) = reshape(randperm(n * 10), 10,: ) ##gives a similar 10x1000000 mat 
@time index_subset = fx(1000000)
> 0.73 sec

#3

Thanks for the response; I will quote my code in the future. It’s defnitely not the benchmarking of global scope gotcha, it kicks in any time anything allocates. I will experiment with putting my big data in a separate module-space – I hadn’t heard that that mattered before, is there some reference I could look at about how the GC treats module space objects different from those in the global repl scope? My workflow depends on lots of interactive exploration applied to the large data frame, so it’s mildly awkward to put it in a module but if it solves this problem it’s worth it.


#4

Ctrl + F “module” in https://github.com/JuliaLang/julia/blob/master/src/gc.c, but its probable the module space trick still won’t work from the global repl (maybe in the future). There’s a chance it will if the other code is also in another seperate module, like [module runcode] <- [main] -> [module datastore], so the dataframe is not in a sub-branch.

Memory mapped files could be a better option
added: How many columns?