byrow function in InMemoryDatasets is fantastic , I have done a small benchmark to compare its performance against DataFrames.jl for relatively wide tables. argmax in DataFrames.jl takes long!
InMemoryDatasets
julia> using InMemoryDatasets
julia> using BenchmarkTools
julia> x = Dataset(rand(10^4, 1000), :auto);
julia> @btime byrow(x, sum, :);
9.107 ms (7056 allocations: 263.27 KiB)
julia> @btime byrow(x, maximum, :);
12.277 ms (7056 allocations: 263.27 KiB)
julia> @btime byrow(x, mean, :);
11.602 ms (14096 allocations: 554.86 KiB)
julia> @btime byrow(x, argmax, :);
17.819 ms (14140 allocations: 766.41 KiB)
DataFrames (updated code)
julia> using DataFrames
julia> using BenchmarkTools
julia> x = DataFrame(rand(10^4, 1000), :auto);
julia> allowmissing!(x);
julia> @btime select(x, AsTable(:) => ByRow(sum));
40.595 ms (2664 allocations: 264.89 KiB)
julia> @btime select(x, AsTable(:) => ByRow(maximum));
40.707 ms (1665 allocations: 233.67 KiB)
julia> @btime select(x, AsTable(:) => ByRow(mean));
45.278 ms (2665 allocations: 264.92 KiB)
julia> @btime select(x, AsTable(:) => ByRow(argmax∘collect));
972.618 ms (21497934 allocations: 567.97 MiB)
I don’t know even how I should do byrow any, all, select, coalesce,… in DataFrames.jl.