Determining size of DataFrame for memory management

I’m working on some data analysis, and I don’t have as much RAM as I’d ideally like. I’m building a variety of datasets by joining several together etc. I’d like to get a list of all objects in the global environment, filter them by Type, then calculate the size of each dataset, and output a table, so I can manually decide which datasets to delete partway through my analysis…

Any tips?

help?> varinfo
search: varinfo

  varinfo(m::Module=Main, pattern::Regex=r""; all::Bool = false, imported::Bool = false, sortby::Symbol = :name, minsize::Int = 0)

Useful! Is there a way to make it into a DataFrame or something so I can analyze it programmatically?

eh, don’t re-invent GC… and you likely can’t clean up variable this way.

if you need to do this kind of gymnastics you probably just want to doing more things lazily / on the fly instead of dumping dataset into in-memory representation (DataFrame)

It’s not so much that I want to reinvent GC, I just want to release some variables if they’re too big.

turns out that varinfo() only gives a few hundred megs of variables… Julia is using 12GB of RAM.

GC.gc() doesn’t help :frowning:

you need recursive=true.


you have leftover reference to these big objects thus GC can’t reclaim them, if you have sample code we might be able to help

It’s just hundreds of lines of reading in csv files as datasets, merging them, plotting things, running a few linear regressions, plotting other things… etc etc.

It’s within a jmd file being include_weaved()

varinfo(sortby=:size)
...
  psamh             24.704 MiB 398515×7 DataFrame                                                                                                                                                
  popdata           19.104 MiB 20596×65 DataFrame                                                                                                                                                
  alldat3            4.241 MiB 3965×87 DataFrame                                                                                                                                                 
  alldat2            4.090 MiB 3965×80 DataFrame                                                                                                                                                 
  alldat             3.991 MiB 3965×76 DataFrame                                                                                                                                                 
  bigcount           2.554 MiB 2493×76 DataFrame                                                                                                                                                 
  alldat5            1.460 MiB 1292×91 DataFrame                                                                                                                                                 
  alldat4            1.419 MiB 1292×87 DataFrame                                                                                                                                                 
  whosuic            1.305 MiB 10980×34 DataFrame                                                                                                                                                
  homdata          908.688 KiB 7808×13 DataFrame                                                                                                                                                 
  gdpstack         603.562 KiB 16492×6 DataFrame                                                                                                                                                 
...

And yet 8 Gig resident:

where do I need this?

varinfo

doesn’t change anything.

you can read it lazily, or better, write out to Arrow.jl and use mmap to read them.

try to see if you can merge them “in-place” so you only keep what you absolutely needed.


have you done a calculation to see if theoretically your data even fits in RAM?

It’s really not THAT big of data. The biggest dataset is 25MB according to that varinfo.

Some of the data is quite big, but I read in one file at a time, take a small subsample, and vcat them together to form that sampled psamh dataset.

what about

varinfo(;all = true, sortby=:size, recursive=true)

same kind of thing:

  psamh             24.704 MiB 398515×7 DataFrame                                                                                                                                                
  popdata           19.104 MiB 20596×65 DataFrame                                                                                                                                                
  alldat3            4.241 MiB 3965×87 DataFrame                                                                                                                                                 
  alldat2            4.090 MiB 3965×80 DataFrame                                                                                                                                                 
  alldat             3.991 MiB 3965×76 DataFrame                                                                                                                                                 
  bigcount           2.554 MiB 2493×76 DataFrame                                                                                                                                                 
  alldat5            1.460 MiB 1292×91 DataFrame                                                                                                                                                 
  alldat4            1.419 MiB 1292×87 DataFrame                                                                                                                                                 
  whosuic            1.305 MiB 10980×34 DataFrame                                                                                                                                                
  homdata          908.688 KiB 7808×13 DataFrame                                                                                                                                                 
  gdpstack         603.562 KiB 16492×6 DataFrame                                                                                                                                                 ```

so is this actually a problem then? are you running into OOM? there’s no reason to prematurely free up memory (by OS) if your system memory usage % is low

Yeah systemd-oomd is killing codium and all its sub-programs (ie. julia) at the moment Julia is using 54% of my available RAM and swap is up to about 4GB out of 9.

can you share script and data?

Yeah, actually it’s on a github, international.jmd over here:

it’ll download some big datasets including the ACS microdata for the last 5 years …

1 Like

does running them not inside jmd help?

good question. I guess I can “tangle()” them and then run that script?

also maybe not inside codium… will have to try that out tomorrow probably.