Performance of DataFrames' subset and ByRow

What’s the difference between ByRow and an anonyomous function like

julia> subset(df, :x => (x -> x .== 0))
3×1 DataFrame
 Row │ x       
     │ Float64 
─────┼─────────
   1 │     0.0
   2 │     0.0
   3 │     0.0

Compilation latency (ByRow does not use broadcasting which is expensive to compile). In both cases I show a fresh session:

julia> using DataFrames

julia> df = DataFrame(x=zeros(3));

julia> @time subset(df, :x => ByRow(==(0)));
  0.929344 seconds (1.01 M allocations: 59.326 MiB, 2.97% gc time, 99.77% compilation time)

julia> @time subset(df, :x => ByRow(==(0)));
  0.000162 seconds (150 allocations: 8.688 KiB)

julia> @time subset(df, :x => ByRow(==(0)));
  0.000156 seconds (150 allocations: 8.688 KiB)

vs

julia> using DataFrames

julia> df = DataFrame(x=zeros(3));

julia> @time subset(df, :x => (x -> x .== 0));
  1.026953 seconds (1.48 M allocations: 83.052 MiB, 4.37% gc time, 100.16% compilation time)

julia> @time subset(df, :x => (x -> x .== 0));
  0.160144 seconds (318.39 k allocations: 17.198 MiB, 99.26% compilation time)

julia> @time subset(df, :x => (x -> x .== 0));
  0.173431 seconds (318.39 k allocations: 17.201 MiB, 4.82% gc time, 99.41% compilation time)
5 Likes

So presumably this isn’t an issue if nrow(df) is 1,000,000 or something and the function a bit more expensive, but it is an issue if I were to do it in a loop with different conditions?

if I were to do it in a loop with different conditions?

In this case better define a function beforehand and make sure it is compiled once:

julia> using DataFrames

julia> df = DataFrame(x=zeros(3));

julia> eq(x) = x .== 0;

julia> @time subset(df, :x => eq);
  0.688346 seconds (1.48 M allocations: 82.848 MiB, 3.88% gc time, 99.77% compilation time)

julia> @time subset(df, :x => eq);
  0.000189 seconds (148 allocations: 8.531 KiB)

julia> @time subset(df, :x => eq);
  0.000135 seconds (148 allocations: 8.531 KiB)

The ByRow trick is that it is type stable and ==(0) is compiled only once (which is a common case):

julia> using DataFrames

julia> df = DataFrame(x=zeros(3));

julia> @time subset(df, :x => ByRow(==(0)));
  0.629286 seconds (1.01 M allocations: 59.326 MiB, 3.67% gc time, 99.75% compilation time)

julia> @time subset(df, :x => ByRow(==(0)));
  0.000142 seconds (150 allocations: 8.688 KiB)

julia> @time subset(df, :x => ByRow(==(1)));
  0.000125 seconds (150 allocations: 8.656 KiB)

julia> @time subset(df, :x => ByRow(==(2)));
  0.000114 seconds (150 allocations: 8.656 KiB)

julia> @time subset(df, :x => ByRow(==(3)));
  0.000109 seconds (150 allocations: 8.656 KiB)

(timings are different, as I have just finished a MS Teams meeting - but I hope you can see the pattern; BTW - videocalls are the biggest enemy of proper benchmarking :smiley:)

3 Likes

Thanks, that’s helpful. Answering another question on here, I just stumbled over an odd performance difference just now. Could you explain the following:

julia> df = DataFrame(sex = ["male", "male", "female", "male", "female", "male"], 
age = [20, 14, 65, 34, 23, 67])        

6×2 DataFrame
 Row │ sex     age   
     │ String  Int64 
─────┼───────────────
   1 │ male       20
   2 │ male       14
   3 │ female     65
   4 │ male       34
   5 │ female     23
   6 │ male       67

julia> @btime @view $df[$df.sex .== "male", :];
  668.153 ns (9 allocations: 448 bytes)

julia> @btime $df[$df.sex .== "male", :];
  1.320 μs (20 allocations: 1.48 KiB)

julia> @btime subset($df, :sex => ByRow(==("male")))
  21.900 μs (152 allocations: 8.91 KiB)

julia> eq_male(x) = x .== "male";

julia> @btime subset($df, :sex => eq_male)
  22.300 μs (147 allocations: 8.69 KiB)

I get the difference between using views and not using them, but why is subset here 20 times slower than simple indexing with a boolean mask?

Because subset has much more complex internal logic than just indexing. If you increate the size of the data frame the performance will be similar (still there will be some overhead):

julia> using DataFrames, BenchmarkTools

julia> df = DataFrame(sex = ["male", "male", "female", "male", "female", "male"],
                      age = [20, 14, 65, 34, 23, 67]);

julia> repeat!(df, 10^7);

julia> nrow(df)
60000000

julia> @btime $df[$df.sex .== "male", :];
  672.016 ms (29 allocations: 922.69 MiB)

julia> @btime subset($df, :sex => ByRow(==("male")));
  708.548 ms (162 allocations: 979.91 MiB)
1 Like

Thanks for clarifying again. Not to let this discussion get out of hand too much (moderators feel free to split this into a “benchmarking DataFrames indexing” thread), but what does the more complex logic in subset buy me? When would I want to use it over just creating a boolean mask and indexing?

I am not a moderator so I am not able to split (and it should be split so if moderators could do this it is welcome), but what subset buys you is mostly consistency between AbstractDataFrame and GroupedDataFrame API. The reason for complexity is that we call select under the hood. We could probably avoid this call to improve speed, but using it makes sure that we are consistent with the whole ecosystem (we do not have to maintain a separate processing logic for subset).

1 Like

Are we facing a compilation-speed tradeoff here? Are the helper methods used in select not being cached the way they could be because we have to use Ref{Any} and @nospecialize?

This is not a compilation but rather specialization tradeoff. Essentially both in subset and in select we avoid specialization to reduce compilation latency. This overhead is small, so it you process 10^9 row data this is negligible, but if you want to process 10^6 tables having 10 rows this will be visible.

The assumption is that if someone has 10^6 tables having 10 rows, then most likely it is better to convert these tables to Tables.columntable and work in a fully type stable mode. DataFrames.jl is inherently type-unstable, so its functions are not super fast in hot-loops (but internally we handle hot loops efficiently, which means that very long or wide tables are handled well).

Is this distinction clear?

2 Likes

Yes, thank you.

It should also be emphasized that this cost is paid once per select call. So if you have a DataFrame with many groups, you don’t have to pay this penalty over and over again.