Determining size of DataFrame for memory management

jling · August 4, 2022, 4:26am

right, idk how the VSCode.jl works, I’d hope if it keeps any reference it’s still visible from somewhere

dlakelan · August 4, 2022, 4:32am

Ok, I guess I’ll try tomorrow.

btw it runs weave() just fine, which seems to launch a separate julia process… so it may well be due to running inside vscode/codium

actually, I take that back… running weave just now and it is now up to 15GB and swap is near 100%

nilshg · August 4, 2022, 6:58am

On my phone now so hard to dig out, but issues with these types of unreclaimable memory have been reported intermittently over the years by CSV/Dataframes users.

I don’t think it’s ever been narrowed down enough with an MWE for anyone to investigate this properly, i.e. work out whether there’s a real issue somewhere or whether it really is just people keeping references around somewhere that they forgot about. If you can turn what you have into a small self contained script that loads in data, bins all references, and then exhibits a large memory footprint after GC that would be really helpful.

dlakelan · August 4, 2022, 2:14pm

I’ll see what I can do. It’s clearly not me keeping refs because varinfo says I’ve got less than ~100 MB but perhaps DataFrames is keeping refs from it’s parsing efforts or something?

I’m wondering if it isn’t related to VS Code keeping refs to hundreds of plots objects though… Scatter plotting 100k data points hundreds of times etc?

I think I will go line by line running the code in vscode and see whether I can see a pattern in memory usage. Thanks for your help I will get back to this thread this afternoon with anything I can figure out

nilshg · August 4, 2022, 2:16pm

Not sure how feasible this is for you as I haven’t looked at the code, but I would maybe try going the other way and try to put the code into a script that can be run in the REPL without any plotting, i.e. just reading in and wrangling the data, computing whatever it is you need, and then dereferencing everything. If that keeps the memory around we might be onto a real issue with the packages used which needs further investigation.

dlakelan · August 4, 2022, 2:17pm

Good idea. Weave can strip the code out and then I can just include it and see where we are at the end of that. Will do that this morning and see what happens.

dlakelan · August 4, 2022, 2:34pm

@nilshg and @jling
Ok, it was a good plan. I guess it’s not too surprising the memory balloons to 10GB after reading and subsampling the 5 year Census ACS data… here is a MWE:

using Pkg
Pkg.activate(".")

using CSV,DataFrames,Downloads,DataFramesMeta, StatsPlots, XLSX, 
    Dates, Statistics, Turing, LinearAlgebra, Interpolations, Serialization,
    GLM, Colors, ColorSchemes



function getifnotthere(filename,URL)
    if !Base.Filesystem.ispath(filename)
        Downloads.download(URL,filename)
    end
end


getifnotthere("data/pums-2020-5yr-hus.zip","https://www2.census.gov/programs-surveys/acs/data/pums/2020/5-Year/csv_hus.zip")


if ! Base.Filesystem.ispath("data/psam_husa.csv") || ! Base.Filesystem.ispath("data/psam_husb.csv")
    cd("data")
    run(`unzip pums-2020-5yr-hus.zip`)
    cd("..")
end

psamh = let psamh = DataFrame()
    for i in ["a","b","c","d"]
        new = @select(CSV.read("data/psam_hus$(i).csv",DataFrame),:SERIALNO,:ST,:FINCP,:NP,:WGTP)
        @subset!(new,.!ismissing.(:FINCP) .&& .! ismissing.(:NP))
        psamh = [psamh ; 
            new[wsample(1:nrow(new),new.WGTP,round(Int64,0.1*nrow(new))),:] ]
    end
    psamh
end
psamh.Year = tryparse.(Int64,psamh.SERIALNO[i][1:4] for i in 1:nrow(psamh))

jling · August 4, 2022, 3:32pm

what if you change this to

new = CSV.read("data/psam_hus$(i).csv", DataFrame; select = [:SERIALNO,:ST,:FINCP,:NP,:WGTP])

dlakelan · August 4, 2022, 3:41pm

I think you mean select = [...] I will try that out. It may help as a workaround. But still my code is valid and should not result in high memory usage after the end of it running.

nilshg · August 4, 2022, 3:43pm

Does help quite a bit on allocations:

julia> @time df = CSV.read(["psam_hus$(i).csv" for i ∈ 'a':'d'], DataFrame);
 81.034727 seconds (6.26 M allocations: 15.158 GiB, 0.84% gc time, 2.18% compilation time)

julia> @time df = CSV.read(["psam_hus$(i).csv" for i ∈ 'a':'d'], DataFrame; select = [:SERIALNo, :ST, :FINCP, :NP, :WGPT]);
 57.979814 seconds (1.61 M allocations: 303.496 MiB, 0.63% compilation time)

(note this reads in all and concatenates all four files)

dlakelan · August 4, 2022, 3:46pm

That’s great, a useful workaround. But why does the original code result in un-freeable memory?

nilshg · August 4, 2022, 3:53pm

I’ll have a look to see what’s going on. I struggle to generate reliable behaviour here: the first time I ran my code above, my system monitor showed 10+GB being used, whereas when I run it now I see:

basically no increase whatsoever. If I then do

julia> dropmissing!(df, [:FINCP, :NP]);

julia> df = df[unique(rand(1:nrow(df), round(Int, 0.11*nrow(df)))), :];

julia> GC.gc()

I can free around 300MB, which I guess is expected.

jling · August 4, 2022, 4:01pm

idk, but I would never make “dump everything into the DF” first operation, because it basically pushes the peak memory to the maximum possible

dlakelan · August 4, 2022, 4:04pm

If what you want to do is well known, then subselecting columns makes sense. If you start out wanting to explore what’s in the dataset… sometimes you just want to see what’s in there.

But yeah, point taken!

dlakelan · August 4, 2022, 4:28pm

Ok, using select reduces the intermediate allocations, and results in my only having about 1700 MB RAM footprint after running the script, except that varinfo() still shows less than about 100MB allocated. so I still think it’s a major memory leak somewhere.

dlakelan · August 4, 2022, 6:43pm

Is there somewhere I should file this as a bug? @nilshg ?

Ok, it looks probably related to this, so I reported it here:

https://github.com/JuliaLang/julia/issues/42566

which is also linked from: Memory issue when repeatedly creating large DataFrames · Issue #2902 · JuliaData/DataFrames.jl · GitHub

Topic		Replies	Views
`varinfo()` on a large SimpleGraph or DataFrame is prohibitively slow Performance	0	336	May 6, 2020
Is varinfo() designed to not report size of an array of sparse matrices? Performance memory-allocation , sparse	2	341	January 22, 2021
Is there a package to list memory consumption of selected data objects? General Usage memory	14	1872	August 2, 2022
Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version Performance dataframes	22	2285	June 29, 2020
DataFrames in Master (with NullableArrays) may use memory wastefully General Usage	9	1099	November 29, 2016

Determining size of DataFrame for memory management

Related topics