Rowwise compuation in `InMemoryDatasets.jl` vs `DataFrames.jl`

byrow function in InMemoryDatasets is fantastic :+1: , I have done a small benchmark to compare its performance against DataFrames.jl for relatively wide tables. argmax in DataFrames.jl takes long!

InMemoryDatasets

julia> using InMemoryDatasets

julia> using BenchmarkTools

julia> x = Dataset(rand(10^4, 1000), :auto);

julia> @btime byrow(x, sum, :);
  9.107 ms (7056 allocations: 263.27 KiB)

julia> @btime byrow(x, maximum, :);
  12.277 ms (7056 allocations: 263.27 KiB)

julia> @btime byrow(x, mean, :);
  11.602 ms (14096 allocations: 554.86 KiB)

julia> @btime byrow(x, argmax, :);
  17.819 ms (14140 allocations: 766.41 KiB)

DataFrames (updated code)

julia> using DataFrames

julia> using BenchmarkTools

julia> x = DataFrame(rand(10^4, 1000), :auto);

julia> allowmissing!(x);

julia> @btime select(x, AsTable(:) => ByRow(sum));
  40.595 ms (2664 allocations: 264.89 KiB)

julia> @btime select(x, AsTable(:) => ByRow(maximum));
  40.707 ms (1665 allocations: 233.67 KiB)

julia> @btime select(x, AsTable(:) => ByRow(mean));
  45.278 ms (2665 allocations: 264.92 KiB)

julia> @btime select(x, AsTable(:) => ByRow(argmax∘collect));
  972.618 ms (21497934 allocations: 567.97 MiB)

I don’t know even how I should do byrow any, all, select, coalesce,… in DataFrames.jl.

1 Like

IMD seems to have multithreading problem in my machine. So the single threading results are here for comparision.


julia> Threads.nthreads()

8

julia> using InMemoryDatasets, BenchmarkTools

julia> x = Dataset(rand(10^4, 1000), :auto);

julia> @btime byrow(x, sum, :);

  23.465 ms (1070 allocations: 125.58 KiB)

julia> @btime byrow(x, maximum, :);

  32.853 ms (1070 allocations: 125.58 KiB)

julia> @btime byrow(x, mean, :);

  29.235 ms (2124 allocations: 279.48 KiB)

julia> @btime byrow(x, argmax, :);

  48.478 ms (2165 allocations: 397.22 KiB)

and


julia> using DataFrames, BenchmarkTools

julia> x = DataFrame(rand(10^4, 1000), :auto);

julia> allowmissing!(x);

julia> @btime select(x, AsTable(:) => ByRow(sum));

  26.776 ms (2664 allocations: 264.89 KiB)

julia> @btime select(x, AsTable(:) => ByRow(maximum));

  27.197 ms (1665 allocations: 233.67 KiB)

julia> @btime select(x, AsTable(:) => ByRow(mean));

  33.093 ms (2665 allocations: 264.92 KiB)

julia> @btime select(x, AsTable(:) => ByRow(argmax));

  29.952 s (34942165 allocations: 150.43 GiB)

and


julia> using DataFrames, BenchmarkTools

julia> x = DataFrame(rand(10^4, 1000), :auto);

julia> #allowmissing!(x);

julia> @btime select(x, AsTable(:) => ByRow(sum));
  4.033 ms (1654 allocations: 223.36 KiB)

julia> @btime select(x, AsTable(:) => ByRow(maximum));
  5.436 ms (1654 allocations: 223.36 KiB)

julia> @btime select(x, AsTable(:) => ByRow(mean));
  4.051 ms (1654 allocations: 223.36 KiB)

julia> @btime select(x, AsTable(:) => ByRow(argmax));
  26.154 s (30051676 allocations: 75.85 GiB)

argmax seems to have improved a lot.

your DataFrames.jl code for argmax should be changed to:

julia> @btime select(x, AsTable(:) => ByRow(argmax∘collect));

this blog has some way to deal with any and all in DataFrames.jl.