Performance of DataFrames' subset and ByRow

nilshg · May 5, 2021, 8:30am

What’s the difference between ByRow and an anonyomous function like

julia> subset(df, :x => (x -> x .== 0))
3×1 DataFrame
 Row │ x       
     │ Float64 
─────┼─────────
   1 │     0.0
   2 │     0.0
   3 │     0.0

bkamins · May 5, 2021, 8:33am

Compilation latency (ByRow does not use broadcasting which is expensive to compile). In both cases I show a fresh session:

julia> using DataFrames

julia> df = DataFrame(x=zeros(3));

julia> @time subset(df, :x => ByRow(==(0)));
  0.929344 seconds (1.01 M allocations: 59.326 MiB, 2.97% gc time, 99.77% compilation time)

julia> @time subset(df, :x => ByRow(==(0)));
  0.000162 seconds (150 allocations: 8.688 KiB)

julia> @time subset(df, :x => ByRow(==(0)));
  0.000156 seconds (150 allocations: 8.688 KiB)

vs

julia> using DataFrames

julia> df = DataFrame(x=zeros(3));

julia> @time subset(df, :x => (x -> x .== 0));
  1.026953 seconds (1.48 M allocations: 83.052 MiB, 4.37% gc time, 100.16% compilation time)

julia> @time subset(df, :x => (x -> x .== 0));
  0.160144 seconds (318.39 k allocations: 17.198 MiB, 99.26% compilation time)

julia> @time subset(df, :x => (x -> x .== 0));
  0.173431 seconds (318.39 k allocations: 17.201 MiB, 4.82% gc time, 99.41% compilation time)

nilshg · May 5, 2021, 8:34am

So presumably this isn’t an issue if nrow(df) is 1,000,000 or something and the function a bit more expensive, but it is an issue if I were to do it in a loop with different conditions?

bkamins · May 5, 2021, 8:41am

if I were to do it in a loop with different conditions?

In this case better define a function beforehand and make sure it is compiled once:

julia> using DataFrames

julia> df = DataFrame(x=zeros(3));

julia> eq(x) = x .== 0;

julia> @time subset(df, :x => eq);
  0.688346 seconds (1.48 M allocations: 82.848 MiB, 3.88% gc time, 99.77% compilation time)

julia> @time subset(df, :x => eq);
  0.000189 seconds (148 allocations: 8.531 KiB)

julia> @time subset(df, :x => eq);
  0.000135 seconds (148 allocations: 8.531 KiB)

The ByRow trick is that it is type stable and ==(0) is compiled only once (which is a common case):

julia> using DataFrames

julia> df = DataFrame(x=zeros(3));

julia> @time subset(df, :x => ByRow(==(0)));
  0.629286 seconds (1.01 M allocations: 59.326 MiB, 3.67% gc time, 99.75% compilation time)

julia> @time subset(df, :x => ByRow(==(0)));
  0.000142 seconds (150 allocations: 8.688 KiB)

julia> @time subset(df, :x => ByRow(==(1)));
  0.000125 seconds (150 allocations: 8.656 KiB)

julia> @time subset(df, :x => ByRow(==(2)));
  0.000114 seconds (150 allocations: 8.656 KiB)

julia> @time subset(df, :x => ByRow(==(3)));
  0.000109 seconds (150 allocations: 8.656 KiB)

(timings are different, as I have just finished a MS Teams meeting - but I hope you can see the pattern; BTW - videocalls are the biggest enemy of proper benchmarking )

nilshg · May 5, 2021, 9:47am

Thanks, that’s helpful. Answering another question on here, I just stumbled over an odd performance difference just now. Could you explain the following:

julia> df = DataFrame(sex = ["male", "male", "female", "male", "female", "male"], 
age = [20, 14, 65, 34, 23, 67])        

6×2 DataFrame
 Row │ sex     age   
     │ String  Int64 
─────┼───────────────
   1 │ male       20
   2 │ male       14
   3 │ female     65
   4 │ male       34
   5 │ female     23
   6 │ male       67

julia> @btime @view $df[$df.sex .== "male", :];
  668.153 ns (9 allocations: 448 bytes)

julia> @btime $df[$df.sex .== "male", :];
  1.320 μs (20 allocations: 1.48 KiB)

julia> @btime subset($df, :sex => ByRow(==("male")))
  21.900 μs (152 allocations: 8.91 KiB)

julia> eq_male(x) = x .== "male";

julia> @btime subset($df, :sex => eq_male)
  22.300 μs (147 allocations: 8.69 KiB)

I get the difference between using views and not using them, but why is subset here 20 times slower than simple indexing with a boolean mask?

bkamins · May 5, 2021, 10:02am

Because subset has much more complex internal logic than just indexing. If you increate the size of the data frame the performance will be similar (still there will be some overhead):

julia> using DataFrames, BenchmarkTools

julia> df = DataFrame(sex = ["male", "male", "female", "male", "female", "male"],
                      age = [20, 14, 65, 34, 23, 67]);

julia> repeat!(df, 10^7);

julia> nrow(df)
60000000

julia> @btime $df[$df.sex .== "male", :];
  672.016 ms (29 allocations: 922.69 MiB)

julia> @btime subset($df, :sex => ByRow(==("male")));
  708.548 ms (162 allocations: 979.91 MiB)

nilshg · May 5, 2021, 10:54am

Thanks for clarifying again. Not to let this discussion get out of hand too much (moderators feel free to split this into a “benchmarking DataFrames indexing” thread), but what does the more complex logic in subset buy me? When would I want to use it over just creating a boolean mask and indexing?

bkamins · May 5, 2021, 11:43am

I am not a moderator so I am not able to split (and it should be split so if moderators could do this it is welcome), but what subset buys you is mostly consistency between AbstractDataFrame and GroupedDataFrame API. The reason for complexity is that we call select under the hood. We could probably avoid this call to improve speed, but using it makes sure that we are consistent with the whole ecosystem (we do not have to maintain a separate processing logic for subset).

pdeffebach · May 5, 2021, 1:39pm

Are we facing a compilation-speed tradeoff here? Are the helper methods used in select not being cached the way they could be because we have to use Ref{Any} and @nospecialize?

bkamins · May 5, 2021, 2:18pm

This is not a compilation but rather specialization tradeoff. Essentially both in subset and in select we avoid specialization to reduce compilation latency. This overhead is small, so it you process 10^9 row data this is negligible, but if you want to process 10^6 tables having 10 rows this will be visible.

The assumption is that if someone has 10^6 tables having 10 rows, then most likely it is better to convert these tables to Tables.columntable and work in a fully type stable mode. DataFrames.jl is inherently type-unstable, so its functions are not super fast in hot-loops (but internally we handle hot loops efficiently, which means that very long or wide tables are handled well).

Is this distinction clear?

pdeffebach · May 5, 2021, 2:31pm

Yes, thank you.

It should also be emphasized that this cost is paid once per select call. So if you have a DataFrame with many groups, you don’t have to pay this penalty over and over again.

Topic		Replies	Views
ByRow subset vs filter performance Data dataframes	6	591	August 19, 2022
ByRow vs broadcasting performance Performance question	0	265	August 20, 2022
Rowwise compuation in `InMemoryDatasets.jl` vs `DataFrames.jl` Performance data , dataframes , inmemorydatasets	2	680	March 23, 2022
Learning to benchmark and find the best function to select a subset of a dataframe New to Julia question	20	438	December 16, 2022
Converting filter using "\|\|" condition to subset General Usage dataframes	4	364	May 16, 2021

Performance of DataFrames' subset and ByRow

Related topics