Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version

If it’s only 0s and 1s, which might end up being interpreted as Int64, then it’s no wonder that the size in memory blows up.

Such a row in your CSV looks like this (if I understood correctly):

0,1,0,1,1,0,...

Which means you have ~2 bytes per value. Int64 has a size of 8 bytes, so you will occupy 4x more “space” in ram.

I’d recommend using some kind of in-place data conversion to Bool or UInt8. You need to help CSV/DataFrames and tell them the exact types. They cannot guess it unless they read the whole data, but that’s already too late…

How about BitArray?

https://web.mit.edu/julia_v0.6.0/julia/share/doc/julia/html/en/stdlib/arrays.html#BitArrays-1

Note that CSV.jl’s memory footprint has been fixed in the latest 0.7 release.

One idea others have mentioned is reading the data in as Bool, which you could do by passing truestrings=["1"], falsestrings=["0"], in case you wanted to go that route.

2 Likes