The column case in this example is a bit unusual: 100,000 columns of only three values which are of different type. I think it doesn’t fit well the design of DataFrames.jl: normally you want all values in a column to have the same type. DataFrames.jl stores data as a list of columns (a vector per column). Here each column has to be of type Any
, so every value is boxed which is very inefficient.
But DataFrame(collect.(data))
works fine if each column (each tuple in data
) has all values of the same type:
# This doesn't print well in the REPL: tuples are not abbreviated like arrays
data_columns = [tuple((row[i] for row in data)...) for i in 1:3];
# Just to see what is inside `data_columns`
julia> collect.(data_columns)
3-element Vector{Vector{T} where T}:
["a", "e", "c", "b", "b", "d", "b", "d", "a", "e" … "e", "b", "c", "c", "c", "c", "b", "b", "b", "c"]
Int8[-108, -127, 27, 123, -73, 77, 126, 12, -70, -101 … -78, 105, -90, -81, -114, 41, 57, -63, -19, -77]
Int8[-66, -77, 51, 5, 27, -62, 0, 11, -107, -75 … 115, 5, -85, 93, -75, -97, -92, 52, -121, -116]
It is the same data but organized as three tuples of 100,000 values. This gives good performance:
tuple2df_columns2(data, names) = DataFrame(collect.(data), names)
julia> @btime tuple2df_rows(data, Names)
457.372 μs (79 allocations: 981.38 KiB)
julia> @btime tuple2df_columns2(data_columns, Names)
411.203 μs (53 allocations: 1.91 MiB)
And it can be further improved by letting the data frame store directly the vectors created by collect
:
tuple2df_columns2_nocopy(data, names) = DataFrame(collect.(data), names, copycols=false)
julia> @btime tuple2df_columns2_nocopy(data_columns, Names)
314.103 μs (47 allocations: 979.73 KiB)