Determining size of DataFrame for memory management

can you share script and data?

Yeah, actually it’s on a github, international.jmd over here:

it’ll download some big datasets including the ACS microdata for the last 5 years …

1 Like

does running them not inside jmd help?

good question. I guess I can “tangle()” them and then run that script?

also maybe not inside codium… will have to try that out tomorrow probably.

right, idk how the VSCode.jl works, I’d hope if it keeps any reference it’s still visible from somewhere

Ok, I guess I’ll try tomorrow.

btw it runs weave() just fine, which seems to launch a separate julia process… so it may well be due to running inside vscode/codium

actually, I take that back… running weave just now and it is now up to 15GB and swap is near 100%

1 Like

On my phone now so hard to dig out, but issues with these types of unreclaimable memory have been reported intermittently over the years by CSV/Dataframes users.

I don’t think it’s ever been narrowed down enough with an MWE for anyone to investigate this properly, i.e. work out whether there’s a real issue somewhere or whether it really is just people keeping references around somewhere that they forgot about. If you can turn what you have into a small self contained script that loads in data, bins all references, and then exhibits a large memory footprint after GC that would be really helpful.

I’ll see what I can do. It’s clearly not me keeping refs because varinfo says I’ve got less than ~100 MB but perhaps DataFrames is keeping refs from it’s parsing efforts or something?

I’m wondering if it isn’t related to VS Code keeping refs to hundreds of plots objects though… Scatter plotting 100k data points hundreds of times etc?

I think I will go line by line running the code in vscode and see whether I can see a pattern in memory usage. Thanks for your help I will get back to this thread this afternoon with anything I can figure out

Not sure how feasible this is for you as I haven’t looked at the code, but I would maybe try going the other way and try to put the code into a script that can be run in the REPL without any plotting, i.e. just reading in and wrangling the data, computing whatever it is you need, and then dereferencing everything. If that keeps the memory around we might be onto a real issue with the packages used which needs further investigation.

Good idea. Weave can strip the code out and then I can just include it and see where we are at the end of that. Will do that this morning and see what happens.

@nilshg and @jling
Ok, it was a good plan. I guess it’s not too surprising the memory balloons to 10GB after reading and subsampling the 5 year Census ACS data… here is a MWE:

using Pkg

using CSV,DataFrames,Downloads,DataFramesMeta, StatsPlots, XLSX, 
    Dates, Statistics, Turing, LinearAlgebra, Interpolations, Serialization,
    GLM, Colors, ColorSchemes

function getifnotthere(filename,URL)
    if !Base.Filesystem.ispath(filename),filename)


if ! Base.Filesystem.ispath("data/psam_husa.csv") || ! Base.Filesystem.ispath("data/psam_husb.csv")

psamh = let psamh = DataFrame()
    for i in ["a","b","c","d"]
        new = @select("data/psam_hus$(i).csv",DataFrame),:SERIALNO,:ST,:FINCP,:NP,:WGTP)
        @subset!(new,.!ismissing.(:FINCP) .&& .! ismissing.(:NP))
        psamh = [psamh ; 
            new[wsample(1:nrow(new),new.WGTP,round(Int64,0.1*nrow(new))),:] ]
psamh.Year = tryparse.(Int64,psamh.SERIALNO[i][1:4] for i in 1:nrow(psamh))

what if you change this to

new ="data/psam_hus$(i).csv", DataFrame; select = [:SERIALNO,:ST,:FINCP,:NP,:WGTP])

I think you mean select = [...] I will try that out. It may help as a workaround. But still my code is valid and should not result in high memory usage after the end of it running.

1 Like

Does help quite a bit on allocations:

julia> @time df =["psam_hus$(i).csv" for i ∈ 'a':'d'], DataFrame);
 81.034727 seconds (6.26 M allocations: 15.158 GiB, 0.84% gc time, 2.18% compilation time)

julia> @time df =["psam_hus$(i).csv" for i ∈ 'a':'d'], DataFrame; select = [:SERIALNo, :ST, :FINCP, :NP, :WGPT]);
 57.979814 seconds (1.61 M allocations: 303.496 MiB, 0.63% compilation time)

(note this reads in all and concatenates all four files)

That’s great, a useful workaround. But why does the original code result in un-freeable memory?

1 Like

I’ll have a look to see what’s going on. I struggle to generate reliable behaviour here: the first time I ran my code above, my system monitor showed 10+GB being used, whereas when I run it now I see:

basically no increase whatsoever. If I then do

julia> dropmissing!(df, [:FINCP, :NP]);

julia> df = df[unique(rand(1:nrow(df), round(Int, 0.11*nrow(df)))), :];

julia> GC.gc()

I can free around 300MB, which I guess is expected.

idk, but I would never make “dump everything into the DF” first operation, because it basically pushes the peak memory to the maximum possible

If what you want to do is well known, then subselecting columns makes sense. If you start out wanting to explore what’s in the dataset… sometimes you just want to see what’s in there.

But yeah, point taken!

1 Like

Ok, using select reduces the intermediate allocations, and results in my only having about 1700 MB RAM footprint after running the script, except that varinfo() still shows less than about 100MB allocated. so I still think it’s a major memory leak somewhere.

1 Like

Is there somewhere I should file this as a bug? @nilshg ?

Ok, it looks probably related to this, so I reported it here:

which is also linked from: Memory issue when repeatedly creating large DataFrames · Issue #2902 · JuliaData/DataFrames.jl · GitHub