I have recently moved to DataFrames 0.11 on Julia 0.6.2 and used StatFiles.jl with FileIO.jl to convert Stata datasets to DataFrames. I was astounded by the memory used by DataFrames. It seems to me that DataFrames requires about six times the amount of memory that Stata uses to store the data. For example, I have a data set of about 611MB. I converted it to DataFrames as follows:
using DataFrames, FileIO, StatFiles, JLD2 df = DataFrame(load("brfss1314.dta")) gc() Base.summarysize(df) @save "brfss1314.jld2" df
Base.summarysize reported 3.35GB
and the saved JLD2 filesize was over 13GB and counting when I stopped it after more than 30 minutes.
As you may know, StatFiles.jl converts Stata data to the same type, i.e., “byte” to “Int8”, “int” to “Int16”, etc. The converted data types in DataFrames are not to blame for the large size. Is there any way that we can make DataFrames to use memory more efficiently? Or is the current state of affairs the future of DataFrames?
I already miss DataArrays a lot.