Memory blow-up when passing DataFrame to function inside @threads loop

#1

When I run the code below, the memory usage blows up:

function func(df::DataFrame)
    X = df[:time]
    indices = findall(X .> 0)
end

# read in R data
rds = "blablab.rds"
objs = load(rds);

params = collect(0.5:0.005:0.7);

for i in 1:length(objs)
    cols = [string(name) for name in names(objs.data[i]) if occursin("blabla",string(name))]
    hypers = [(a,b) for a in cols, b in params]

    results = [DataFrame() for _ in 1:length(hypers)]

    # HERE IS WHERE THE MEMORY BLOWS UP
    Threads.@threads for hi in 1:length(hypers)
        name, val = hypers[hi]
        results[hi] = func(objs.data[i])
    end
end

df is 0.7GB. When I run this piece of code my memory usage goes up to ~30GB!!! It seems like just accessing a column of df inside func() is copying the whole thing?

0 Likes

#2

It looks like it could be similar to the issue that I ran into here:

If it is the same issue, I believe it was fixed on the julia master branch a few months ago, but hasn’t made it into a release yet. Again, if it is the same problem, things should work with julia 1.0.3 and earlier. I’ve been sticking with that release until we get a new bug fix release.

0 Likes