Memory allocations when converting from NamedTuples to DataFrame

I have been struggling with this since a while. When I’m trying to convert an Array of NamedTuple to DataFrame, I’m seeing excessive memory allocations. Consider the following stripped-down example:

using DataFrames
using BenchmarkTools
A = fill((a=1, b=2.0), 10000000)

julia> sizeof(A) / 1000000
160.0

julia> @btime DataFrame($A);
  1.275 s (30000024 allocations: 915.53 MiB)

Memory allocations exceed the actual size of data several times. With my real data, which contains ~30 colums of mixed Ints and Floats, the difference is even more dramatic. I’m getting 2 GB of allocations on a 50 MB array. This is still feasible, but soon I’ll have to scale the problem up significantly, and while the data itself should still fit comfortably into memory, I’ll run into problems when converting to a DataFrame.

I’ll give you some context of what I’m trying to do so that maybe someone can suggest an alternative approach or point out an error. I have a structured binary file in a propriety format consisting of blocks of varying length, the length of each given by an integer at the beginning of the block. The best way to load it I could come up with is to loop through the file and load chunks into an Array of NamedTuple. I can loop first without loading anything to figure out the size of the array I need and then loop again to read data into a preallocated array. I don’t really care about preserving the block structure. So far I’m very content with this approach, as it’s fast and efficient.

However, I would like to leverage the nice functionalities of DataFrames, like groupby, etc. I could maybe live with temporarily creating a second copy of the data in memory and that’s what I expected to see when converting from an Array of NamedTuple, but then I ran into the above mentioned issue. I also tried to preallocate the DataFrame and update its rows as I loop through the chunks of data, but it seems that the in-place assignment to a DataFrameRow is allocating as well (so not really in-place?):

df = DataFrame(A)

julia> @btime $df[1,:] = $A[1];
  177.883 ns (4 allocations: 96 bytes)

Which, by the way, seems to exactly determine the size allocations when converting the whole array:

julia> 96 * 10000000 / 1024^2
915.52734375

I tried other approaches along these lines, but somehow I always end up with similar amount of allocated memory that is much larger than what I’d expect. I have a feeling I’m doing something fundamentally wrong.

Any help will be very appreciated!

Alternatively, you could just allocate two vectors for a and b, push! elements into them as they appear, and then use the non-copying DataFrame! constructor that takes columns directly.

1 Like

I also think it’s easier to “append” as you go. StructArrays may help you to do that. So, something like:

julia> using StructArrays

julia> iter = (fill((a=1, b=2.0), 1_000) for _ in 1_000);

julia> sa = foldl(iter, init=nothing) do acc, val
           isnothing(acc) ? StructArray(val) : append!(acc, val)
       end
1000-element StructArray(::Array{Int64,1}, ::Array{Float64,1}) with eltype NamedTuple{(:a, :b),Tuple{Int64,Float64}}:
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 ⋮
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)
 (a = 1, b = 2.0)

Note that you should replace append! with StructArrays.append!! if the element-type may change during iteration (like missing data or things like that).

The StructArray you get at the end has columnar storage, and you can retrieve those with fieldarrays, so DataFrame!(fieldarrays(sa)) should not allocate (other than some small fixed overhead).

1 Like

OP I think the best solution is to, if possible, read in your data as a NamedTuple of Vectors. That way you can call DataFrame! without copying.

You are always going to get copies if you use a Vector of NamedTuples.

However the strategy of pre-allocating vectors and updating them incrementally also seems strong. Easiest to do it without data frames first and then call DataFrame!.

1 Like

Thank you all for great hints! In the end, like you pointed out, it boils down to storing the data in columns, not rows, and then converting to DataFrame. StructArrays come in very handy here. This way I managed to eliminate allocations connected to DataFrame entirely.