I have been struggling with this since a while. When I’m trying to convert an Array of NamedTuple to DataFrame, I’m seeing excessive memory allocations. Consider the following stripped-down example:
using DataFrames
using BenchmarkTools
A = fill((a=1, b=2.0), 10000000)
julia> sizeof(A) / 1000000
160.0
julia> @btime DataFrame($A);
1.275 s (30000024 allocations: 915.53 MiB)
Memory allocations exceed the actual size of data several times. With my real data, which contains ~30 colums of mixed Ints and Floats, the difference is even more dramatic. I’m getting 2 GB of allocations on a 50 MB array. This is still feasible, but soon I’ll have to scale the problem up significantly, and while the data itself should still fit comfortably into memory, I’ll run into problems when converting to a DataFrame.
I’ll give you some context of what I’m trying to do so that maybe someone can suggest an alternative approach or point out an error. I have a structured binary file in a propriety format consisting of blocks of varying length, the length of each given by an integer at the beginning of the block. The best way to load it I could come up with is to loop through the file and load chunks into an Array of NamedTuple. I can loop first without loading anything to figure out the size of the array I need and then loop again to read data into a preallocated array. I don’t really care about preserving the block structure. So far I’m very content with this approach, as it’s fast and efficient.
However, I would like to leverage the nice functionalities of DataFrames, like groupby, etc. I could maybe live with temporarily creating a second copy of the data in memory and that’s what I expected to see when converting from an Array of NamedTuple, but then I ran into the above mentioned issue. I also tried to preallocate the DataFrame and update its rows as I loop through the chunks of data, but it seems that the in-place assignment to a DataFrameRow is allocating as well (so not really in-place?):
df = DataFrame(A)
julia> @btime $df[1,:] = $A[1];
177.883 ns (4 allocations: 96 bytes)
Which, by the way, seems to exactly determine the size allocations when converting the whole array:
julia> 96 * 10000000 / 1024^2
915.52734375
I tried other approaches along these lines, but somehow I always end up with similar amount of allocated memory that is much larger than what I’d expect. I have a feeling I’m doing something fundamentally wrong.
Any help will be very appreciated!