Data set size in DataFrames with Vector{T, Missing}


#1

I have recently moved to DataFrames 0.11 on Julia 0.6.2 and used StatFiles.jl with FileIO.jl to convert Stata datasets to DataFrames. I was astounded by the memory used by DataFrames. It seems to me that DataFrames requires about six times the amount of memory that Stata uses to store the data. For example, I have a data set of about 611MB. I converted it to DataFrames as follows:

using DataFrames, FileIO, StatFiles, JLD2
df = DataFrame(load("brfss1314.dta"))
gc()
Base.summarysize(df)
@save "brfss1314.jld2" df

Base.summarysize reported 3.35GB
and the saved JLD2 filesize was over 13GB and counting when I stopped it after more than 30 minutes.

As you may know, StatFiles.jl converts Stata data to the same type, i.e., “byte” to “Int8”, “int” to “Int16”, etc. The converted data types in DataFrames are not to blame for the large size. Is there any way that we can make DataFrames to use memory more efficiently? Or is the current state of affairs the future of DataFrames?

I already miss DataArrays a lot.


#2

Arrays of Union{T,Missing} are not efficiently stored on julia 0.6, so that might explain it. That problem should go away on julia 0.7, which has a more efficient memory layout story for those types of arrays.

One short term solution would be that DataFrames.jl could add DataArrays.jl back into its REQUIRE file. At that point I could actually use DataArrays in the iterable tables constructor story that underlies the StatFiles.jl design. @nalimilan is that something you would consider? There wouldn’t even have to be a using DataArray or import DataArray statement in DataFrames.jl, it would literally be enough to add that one line in REQUIRE. And it could be a short term thing, i.e. just until julia 0.7 comes out.

Another short term (or maybe even long term) option is to use IndexedTables.jl, which has an efficient memory layout for columns with missing data on julia 0.6 (it uses DataValueVector from DataValues.jl).

Of course, all of these ideas only make sense if the memory layout is actually the cultprit here. Seems likely to me, but I’m not 100% sure.


#3

Yes, that’s most probably a temporary issue which is fixed on Julia 0.7. Using DataArray is the best workaround for now.

@davidanthoff Why would DataFrames need to require DataArrays? Shouldn’t the dependency be added to StatFiles instead, since that’s the package which would actually construct DataArray objects?


#4

Thank you. I will be more patient now that I know that it will be better with Julia 0.7.