The column case in this example is a bit unusual: 100,000 columns of only three values which are of different type. I think it doesn’t fit well the design of DataFrames.jl: normally you want all values in a column to have the same type. DataFrames.jl stores data as a list of columns (a vector per column). Here each column has to be of type Any, so every value is boxed which is very inefficient.
But DataFrame(collect.(data)) works fine if each column (each tuple in data) has all values of the same type:
# This doesn't print well in the REPL: tuples are not abbreviated like arrays
data_columns = [tuple((row[i] for row in data)...) for i in 1:3];
# Just to see what is inside `data_columns`
julia> collect.(data_columns)
3-element Vector{Vector{T} where T}:
["a", "e", "c", "b", "b", "d", "b", "d", "a", "e" … "e", "b", "c", "c", "c", "c", "b", "b", "b", "c"]
Int8[-108, -127, 27, 123, -73, 77, 126, 12, -70, -101 … -78, 105, -90, -81, -114, 41, 57, -63, -19, -77]
Int8[-66, -77, 51, 5, 27, -62, 0, 11, -107, -75 … 115, 5, -85, 93, -75, -97, -92, 52, -121, -116]
It is the same data but organized as three tuples of 100,000 values. This gives good performance:
tuple2df_columns2(data, names) = DataFrame(collect.(data), names)
julia> @btime tuple2df_rows(data, Names)
457.372 μs (79 allocations: 981.38 KiB)
julia> @btime tuple2df_columns2(data_columns, Names)
411.203 μs (53 allocations: 1.91 MiB)
And it can be further improved by letting the data frame store directly the vectors created by collect:
tuple2df_columns2_nocopy(data, names) = DataFrame(collect.(data), names, copycols=false)
julia> @btime tuple2df_columns2_nocopy(data_columns, Names)
314.103 μs (47 allocations: 979.73 KiB)