Data set size in DataFrames with Vector{T, Missing}

mwsohn · March 31, 2018, 2:59am

I have recently moved to DataFrames 0.11 on Julia 0.6.2 and used StatFiles.jl with FileIO.jl to convert Stata datasets to DataFrames. I was astounded by the memory used by DataFrames. It seems to me that DataFrames requires about six times the amount of memory that Stata uses to store the data. For example, I have a data set of about 611MB. I converted it to DataFrames as follows:

using DataFrames, FileIO, StatFiles, JLD2
df = DataFrame(load("brfss1314.dta"))
gc()
Base.summarysize(df)
@save "brfss1314.jld2" df

Base.summarysize reported 3.35GB
and the saved JLD2 filesize was over 13GB and counting when I stopped it after more than 30 minutes.

As you may know, StatFiles.jl converts Stata data to the same type, i.e., “byte” to “Int8”, “int” to “Int16”, etc. The converted data types in DataFrames are not to blame for the large size. Is there any way that we can make DataFrames to use memory more efficiently? Or is the current state of affairs the future of DataFrames?

I already miss DataArrays a lot.

davidanthoff · March 31, 2018, 3:13am

Arrays of Union{T,Missing} are not efficiently stored on julia 0.6, so that might explain it. That problem should go away on julia 0.7, which has a more efficient memory layout story for those types of arrays.

One short term solution would be that DataFrames.jl could add DataArrays.jl back into its REQUIRE file. At that point I could actually use DataArrays in the iterable tables constructor story that underlies the StatFiles.jl design. @nalimilan is that something you would consider? There wouldn’t even have to be a using DataArray or import DataArray statement in DataFrames.jl, it would literally be enough to add that one line in REQUIRE. And it could be a short term thing, i.e. just until julia 0.7 comes out.

Another short term (or maybe even long term) option is to use IndexedTables.jl, which has an efficient memory layout for columns with missing data on julia 0.6 (it uses DataValueVector from DataValues.jl).

Of course, all of these ideas only make sense if the memory layout is actually the cultprit here. Seems likely to me, but I’m not 100% sure.

nalimilan · March 31, 2018, 9:58am

Yes, that’s most probably a temporary issue which is fixed on Julia 0.7. Using DataArray is the best workaround for now.

@davidanthoff Why would DataFrames need to require DataArrays? Shouldn’t the dependency be added to StatFiles instead, since that’s the package which would actually construct DataArray objects?

mwsohn · April 1, 2018, 9:12pm

Thank you. I will be more patient now that I know that it will be better with Julia 0.7.

Topic		Replies	Views
DataFrames in Master (with NullableArrays) may use memory wastefully General Usage	9	1101	November 29, 2016
Will the new DataFrames be memory mapped? Data question , package	11	2365	January 20, 2020
Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version Performance dataframes	22	2296	June 29, 2020
DataFrames 0.11 released Data announcement	27	11451	December 19, 2017
Determining size of DataFrame for memory management General Usage memory , dataframes	35	1709	August 4, 2022

Data set size in DataFrames with Vector{T, Missing}

Related topics