Learning to benchmark and find the best function to select a subset of a dataframe

martinmestre · December 15, 2022, 1:46pm

Hello, I have issues related with the same piece of code:

using DataFrames
using BenchmarkTools

function fun1(df::DataFrame, bool_exp::BitVector)
    df_temp = df[bool_exp,:]
    return df_temp
end

function fun2(df::DataFrame, bool_exp::BitVector)
    df_temp = @subset(df, collect(bool_exp))
    return df_temp
end

function fun3!(df::DataFrame, bool_exp::BitVector)
    @subset!(df, collect(bool_exp))
    return nothing
end

df = DataFrame(x=rand(100),y=rand(100),z=rand(100))

bool_x = df.x .> 0.5

@time df1=fun1(df, bool_x)
@time df2=fun2(df, bool_x)
@time fun3!(df, bool_x)

The results are:

0.000015 seconds (12 allocations: 2.188 KiB)
0.043226 seconds (36.44 k allocations: 2.151 MiB, 97.86% compilation time)
0.044530 seconds (36.42 k allocations: 2.133 MiB, 98.24% compilation time)

I was expecting that the mutating function fun3! would be the optimal regarding time and memory. Why this is not true ?

The other question is how can I @benchmark a mutating function?
Doing @benchmark fun3!(df, bool_x) gives error because the size of dataframe argument gets modified in each step of the process.

Finally, I would like to know if there are other better options to filter a dataframe that I haven’t considered.

Thank you very much!

lmiq · December 15, 2022, 2:19pm

Use

@benchmark f(x) setup=(x=copy(x_input)) evals=1

where x_input is the initial value you want.

martinmestre · December 15, 2022, 2:58pm

Thanks @lmiq !
Doing as you say:

@benchmark df1=fun1(x, bool_x) setup=(x=copy(df)) evals=1
@benchmark df2=fun2(x, bool_x) setup=(x=copy(df)) evals=1
@benchmark fun3!(x, bool_x) setup=(x=copy(df)) evals=1

I obtain the following result:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  840.000 ns … 30.324 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.282 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.631 μs ±  1.233 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▆▄▃▂▁▁▂▄▆▇▆▅▄▃▁▁▁▂▂▂▂▂▁                                    ▂
  ███████████████████████████▇▇▆▆▅▅▇▆▆▆▅▄▃▅▃▅▁▄▃▃▅▃▄▁▄▃▃▅▄▃▄▄▆ █
  840 ns        Histogram: log(frequency) by time      6.46 μs <

 Memory estimate: 2.69 KiB, allocs estimate: 12.

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  26.221 μs … 113.102 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     28.308 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.833 μs ±   5.726 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄████▇▆▅▄▃▃▂▁▁▁           ▁                                  ▂
  ████████████████▇▇▇▆▇▆▆▅▂▅████████▇▇▇▇▇▆▇▄▂▆▆▅▅▄▅▅▅▅▄▆▆▆▆▄▅▅ █
  26.2 μs       Histogram: log(frequency) by time      57.7 μs <

 Memory estimate: 11.98 KiB, allocs estimate: 190.

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  26.190 μs … 130.973 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     27.507 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   28.823 μs ±   4.829 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▇█▇▆▅▄▃▃▂▁▁                                                 ▂
  ██████████████▇▇▆▅▆▄▆▅▄▃▄▃▃▄▇█▇▇███▆▆▆▅▅▅▅▅▅▄▄▃▄▅▅▄▄▅▃▄▃▄▄▅▅ █
  26.2 μs       Histogram: log(frequency) by time      54.3 μs <

 Memory estimate: 9.81 KiB, allocs estimate: 181.

Do you know why the first function is the best in terms of speed and memory usage? I was expecting that the mutating function to be the best. Are there other options?

lmiq · December 15, 2022, 3:58pm

I know nothing about the @subset macros, but the collect(bool_vector) you are using in both other cases by itself is allocating an intermediate array which I guess is not necessary. I don’t see why you are collecting it there, as afaik collecting it just returns a copy of the same vector in this case.

martinmestre · December 15, 2022, 4:07pm

Thanks. I use collect because it transforms the BitVector to Array{Bool}. I do not know why @subset needs the second object as argument.

pdeffebach · December 15, 2022, 4:31pm

The need for collect is a bug. I just listed it and am tracking it here

martinmestre · December 15, 2022, 4:39pm

Thanks @pdeffebach.
Once this bug is solved, using the mutating @subset! will be the optimum way to filter a dataframe, or is still the above fun1 function the best ?

pdeffebach · December 15, 2022, 4:41pm

It’s just a parsing issue. So if you re-run your benchmark with identity instead of collect you can benchmark.

pdeffebach · December 15, 2022, 4:43pm

fun1 is the best because there is a little overhead in the DataFrames.subset function that @subset calls. This overhead matters because your data frame is so small. Try doing with a million observations and see what happens.

martinmestre · December 15, 2022, 5:02pm

Thanks!
For dataframe with a million rows I obtain the following:

BenchmarkTools.Trial: 829 samples with 1 evaluation.
 Range (min … max):  2.934 ms … 12.817 ms  ┊ GC (min … max): 0.00% … 59.91%
 Time  (median):     3.368 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.739 ms ±  1.274 ms  ┊ GC (mean ± σ):  5.86% ± 11.59%

  ▁▃██▄▃▃▃▂ ▁ ▁                                               
  █████████████▇▇▄▆▁▅▅▄▁▄▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▇▆▆▁▄▅ █
  2.93 ms      Histogram: log(frequency) by time     10.4 ms <

 Memory estimate: 15.23 MiB, allocs estimate: 20.

BenchmarkTools.Trial: 726 samples with 1 evaluation.
 Range (min … max):  3.878 ms … 11.137 ms  ┊ GC (min … max): 0.00% … 57.65%
 Time  (median):     4.291 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.626 ms ±  1.206 ms  ┊ GC (mean ± σ):  4.57% ± 10.52%

   ▂██▆▃▂▂▂  ▁▁                                               
  ██████████▇██▇▇▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆▆█ ▇
  3.88 ms      Histogram: log(frequency) by time       11 ms <

 Memory estimate: 15.36 MiB, allocs estimate: 201.

BenchmarkTools.Trial: 350 samples with 1 evaluation.
 Range (min … max):  11.574 ms …  19.007 ms  ┊ GC (min … max): 0.00% … 38.25%
 Time  (median):     11.887 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.975 ms ± 642.910 μs  ┊ GC (mean ± σ):  0.48% ±  3.33%

         ▄▂▄▄▄█▄▄▄▆▂▃   ▃                                       
  ▃▆▆▆▆███████████████▅▆█▄▄▄▄▅▃▇▃▃▃▄▃▃▃▃▃▃▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▃▁▁▁▃ ▄
  11.6 ms         Histogram: frequency by time         12.9 ms <

 Memory estimate: 4.08 MiB, allocs estimate: 196.

fun1 is faster but fun3! uses less memory. Correct? What is the variable “allocs estimate”? Please let me know which function would you use. Thanks!

pdeffebach · December 15, 2022, 5:07pm

Huh. I dunno. I wonder if something is going on with the benchmarking. Maybe @bkamins has some intuition for why fun3! is so slow.

bkamins · December 15, 2022, 5:52pm

Yes, I can answer it:

julia> x = rand(10^6);

julia> bx = [v < 0.5 for v in x];

julia> ix = findall(!, bx);

julia> @benchmark z[bx] setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 745 samples with 1 evaluation.
 Range (min … max):  4.486 ms … 28.198 ms  ┊ GC (min … max): 0.00% … 81.56%
 Time  (median):     4.569 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.939 ms ±  2.251 ms  ┊ GC (mean ± σ):  4.37% ±  8.00%

  █▇▄▄▃▂▁▂▁▁
  ██████████▇▇▇█▇▇▆▅▆▆▁▅▅▅▆▇▆▆▄▇▅▅▇▆▄▁▄▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▄ ▇
  4.49 ms      Histogram: log(frequency) by time     7.24 ms <

 Memory estimate: 3.82 MiB, allocs estimate: 2.

julia> @benchmark z[ix] setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 1398 samples with 1 evaluation.
 Range (min … max):  1.287 ms … 29.172 ms  ┊ GC (min … max):  0.00% … 94.65%
 Time  (median):     1.355 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   1.699 ms ±  2.259 ms  ┊ GC (mean ± σ):  12.15% ±  8.70%

  ██▅▄▂▁▁                    ▁
  ███████▇▇█▇▆▄▇▅▄▄▄▄▄▁▄▄▄▄▄▆█▇▇▇▆▄▄▄▄▅▆▅▄▄▁▁▄▄▄▁▁▄▁▁▁▁▁▁▁▁▄ █
  1.29 ms      Histogram: log(frequency) by time     4.27 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 2.

julia> @benchmark deleteat!(z, bx) setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 1764 samples with 1 evaluation.
 Range (min … max):  798.600 μs …   3.617 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     825.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   927.020 μs ± 304.150 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▄▂▂▂▁ ▁
  ██████████▇▇▇▆▅▅▅▇▅▃▃▅▅▇▁▄▅▃▁▄▁▃▁▃▁▁▁▁▄▃▁▁▁▁▁▃▃▁▄██▇▇▇▆▁▄▁▄▃▅ █
  799 μs        Histogram: log(frequency) by time       2.22 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark deleteat!(z, ix) setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 859 samples with 1 evaluation.
 Range (min … max):  3.794 ms …   6.477 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.828 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.976 ms ± 316.706 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▂▂▂▁▃▁  ▁
  ████████████▇██▇▆▅▆▇▅▅▇▆▇▄▅▆▅▆▅▆▆▅▁▅▄▄▅▅▄▅▅▄▄▆▅▁▅▄▁▅▅▅▅▁▁▁▅ ▇
  3.79 ms      Histogram: log(frequency) by time      5.21 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

And the current code in DataFrames.jl assumed that deleteat! and getindex had the same relationship in performance. Which clearly it has not. I will make a PR to fix it.

bkamins · December 15, 2022, 7:27pm

fixed in fix deleteat! and subset! performance by bkamins · Pull Request #3249 · JuliaData/DataFrames.jl · GitHub

martinmestre · December 16, 2022, 5:45pm

Hi @bkamins. As I am new to julia, I would like to clarify a few things.
Could you explain me which is more performant: deleteat! or getindex ?
Finally, I would like to know if updating DataFrames.jl includes this fix and which function do you suggest me to use fun1, fun2 or the mutating fun3! (defined at the top of the thread) ?
Best

martinmestre · December 16, 2022, 5:51pm

I looked at the PR and saw you removed the use of the findall function. What are you using instead of it? What is the relation with deleteat! ?
Thanks!

bkamins · December 16, 2022, 6:16pm

The answer to these questions is in my post above. The speed of operations is as follows (from fastest to slowest):

deleteat! with Bool index
getindex with integer index
deleteat! with integer index
getindex with Bool index

Note that getindex allocates a new vector, while deleteat! is in place. In the PR I remove findall as findall changes Bool index into integer index. This is desired for getindex, but bad for deleteat!. The original code for deleteat! was adapted from getindex, so findall stayed there, but should have been removed.

martinmestre · December 16, 2022, 6:33pm

Thanks, I understood the ranking.
I am trying to understand what does the ! does in this expression:

Summing up, once I update the DataFrames package the best option will be to use my mutating fun3! function instead of the others?

bkamins · December 16, 2022, 6:42pm

findall(!, bx) means: find indices of all values in bx which are true after function ! is applied to them. Since ! is negation this means that we find indices of false values.

once I update the DataFrames package

Precisely - after the PR is merged and a patch release is made. Since this is not a bug but performance issue the release might not be within a few days (we release bug fixes on daily basis).

If you want an in-place function then in the long run fun3! will be fastest, but it might take some time till the patch is released as I commented.

martinmestre · December 16, 2022, 6:53pm

So bx is a boolean vector for those componentsv<0.5, while ix gives the indexes of the elements v>0.5. Isn’t the operations z[bx] and z[ix] doing opposite things? The same happens with deleteat!(z,bx) and deleteat!(z,ix) I guess? Thanks for the patience.

bkamins · December 16, 2022, 7:01pm

Yes, also the difference is that z[ix] and deleteat!(z, ix) do a different thing (one keeps indices, the other drops them). But since the threshold is 0.5 and we have 10^6 elements the number of elements to keep and to drop is roughly equal so it is not a problem.

I used ix = findall(!, bx) to make z[ix] and deleteat!(z, bx) equivalent as most likely you want the result of the operations to be the same and they are the performant options (but indeed I could have done it more carefully).

Topic		Replies	Views
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24043	April 27, 2024
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10573	April 22, 2022
Release announcements for DataFrames.jl Data announcement , dataframes	190	24534	September 28, 2023
DataFrame transformation is so slow, what am I doing wrong? Performance compilation , dataframes	17	341	May 19, 2024
How to speed up the for-loop with dataframe access Performance dataframes	25	1180	April 14, 2022

Learning to benchmark and find the best function to select a subset of a dataframe

Related topics