Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version

tamasgal · June 15, 2020, 9:40pm

If it’s only 0s and 1s, which might end up being interpreted as Int64, then it’s no wonder that the size in memory blows up.

Such a row in your CSV looks like this (if I understood correctly):

0,1,0,1,1,0,...

Which means you have ~2 bytes per value. Int64 has a size of 8 bytes, so you will occupy 4x more “space” in ram.

I’d recommend using some kind of in-place data conversion to Bool or UInt8. You need to help CSV/DataFrames and tell them the exact types. They cannot guess it unless they read the whole data, but that’s already too late…

freeman · June 16, 2020, 7:07pm

How about BitArray?

https://web.mit.edu/julia_v0.6.0/julia/share/doc/julia/html/en/stdlib/arrays.html#BitArrays-1

quinnj · June 29, 2020, 2:45am

Note that CSV.jl’s memory footprint has been fixed in the latest 0.7 release.

One idea others have mentioned is reading the data in as Bool, which you could do by passing truestrings=["1"], falsestrings=["0"], in case you wanted to go that route.

Topic		Replies	Views
CSV.write("*.txt",DataFrame) ReadOnlyMemoryError() General Usage dataframes	14	1012	January 9, 2020
DataFrames in Master (with NullableArrays) may use memory wastefully General Usage	9	1099	November 29, 2016
Memory build-up when loading DataFrames in a loop Performance question , dataframes	2	160	February 3, 2024
!cat julia General Usage question	1	336	September 3, 2019
CSV read performance vs Pandas General Usage	29	8135	May 6, 2019

Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version

Related topics