Learning to benchmark and find the best function to select a subset of a dataframe

Hello, I have issues related with the same piece of code:

using DataFrames
using BenchmarkTools

function fun1(df::DataFrame, bool_exp::BitVector)
    df_temp = df[bool_exp,:]
    return df_temp
end

function fun2(df::DataFrame, bool_exp::BitVector)
    df_temp = @subset(df, collect(bool_exp))
    return df_temp
end

function fun3!(df::DataFrame, bool_exp::BitVector)
    @subset!(df, collect(bool_exp))
    return nothing
end

df = DataFrame(x=rand(100),y=rand(100),z=rand(100))

bool_x = df.x .> 0.5

@time df1=fun1(df, bool_x)
@time df2=fun2(df, bool_x)
@time fun3!(df, bool_x)

The results are:

0.000015 seconds (12 allocations: 2.188 KiB)
0.043226 seconds (36.44 k allocations: 2.151 MiB, 97.86% compilation time)
0.044530 seconds (36.42 k allocations: 2.133 MiB, 98.24% compilation time)

I was expecting that the mutating function fun3! would be the optimal regarding time and memory. Why this is not true ?

The other question is how can I @benchmark a mutating function?
Doing @benchmark fun3!(df, bool_x) gives error because the size of dataframe argument gets modified in each step of the process.

Finally, I would like to know if there are other better options to filter a dataframe that I haven’t considered.

Thank you very much!

Use

@benchmark f(x) setup=(x=copy(x_input)) evals=1

where x_input is the initial value you want.

1 Like

Thanks @lmiq !
Doing as you say:

@benchmark df1=fun1(x, bool_x) setup=(x=copy(df)) evals=1
@benchmark df2=fun2(x, bool_x) setup=(x=copy(df)) evals=1
@benchmark fun3!(x, bool_x) setup=(x=copy(df)) evals=1

I obtain the following result:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  840.000 ns … 30.324 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.282 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.631 ΞΌs Β±  1.233 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–ˆβ–†β–„β–ƒβ–‚β–β–β–‚β–„β–†β–‡β–†β–…β–„β–ƒβ–β–β–β–‚β–‚β–‚β–‚β–‚β–                                    β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–…β–…β–‡β–†β–†β–†β–…β–„β–ƒβ–…β–ƒβ–…β–β–„β–ƒβ–ƒβ–…β–ƒβ–„β–β–„β–ƒβ–ƒβ–…β–„β–ƒβ–„β–„β–† β–ˆ
  840 ns        Histogram: log(frequency) by time      6.46 ΞΌs <

 Memory estimate: 2.69 KiB, allocs estimate: 12.

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  26.221 ΞΌs … 113.102 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     28.308 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   29.833 ΞΌs Β±   5.726 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–ƒβ–ƒβ–‚β–β–β–           ▁                                  β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–†β–†β–…β–‚β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–†β–‡β–„β–‚β–†β–†β–…β–…β–„β–…β–…β–…β–…β–„β–†β–†β–†β–†β–„β–…β–… β–ˆ
  26.2 ΞΌs       Histogram: log(frequency) by time      57.7 ΞΌs <

 Memory estimate: 11.98 KiB, allocs estimate: 190.

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  26.190 ΞΌs … 130.973 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     27.507 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   28.823 ΞΌs Β±   4.829 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„β–‡β–ˆβ–‡β–†β–…β–„β–ƒβ–ƒβ–‚β–β–                                                 β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–†β–„β–†β–…β–„β–ƒβ–„β–ƒβ–ƒβ–„β–‡β–ˆβ–‡β–‡β–ˆβ–ˆβ–ˆβ–†β–†β–†β–…β–…β–…β–…β–…β–…β–„β–„β–ƒβ–„β–…β–…β–„β–„β–…β–ƒβ–„β–ƒβ–„β–„β–…β–… β–ˆ
  26.2 ΞΌs       Histogram: log(frequency) by time      54.3 ΞΌs <

 Memory estimate: 9.81 KiB, allocs estimate: 181.

Do you know why the first function is the best in terms of speed and memory usage? I was expecting that the mutating function to be the best. Are there other options?

I know nothing about the @subset macros, but the collect(bool_vector) you are using in both other cases by itself is allocating an intermediate array which I guess is not necessary. I don’t see why you are collecting it there, as afaik collecting it just returns a copy of the same vector in this case.

Thanks. I use collect because it transforms the BitVector to Array{Bool}. I do not know why @subset needs the second object as argument.

1 Like

The need for collect is a bug. I just listed it and am tracking it here

1 Like

Thanks @pdeffebach.
Once this bug is solved, using the mutating @subset! will be the optimum way to filter a dataframe, or is still the above fun1 function the best ?

It’s just a parsing issue. So if you re-run your benchmark with identity instead of collect you can benchmark.

1 Like

fun1 is the best because there is a little overhead in the DataFrames.subset function that @subset calls. This overhead matters because your data frame is so small. Try doing with a million observations and see what happens.

1 Like

Thanks!
For dataframe with a million rows I obtain the following:

BenchmarkTools.Trial: 829 samples with 1 evaluation.
 Range (min … max):  2.934 ms … 12.817 ms  β”Š GC (min … max): 0.00% … 59.91%
 Time  (median):     3.368 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.739 ms Β±  1.274 ms  β”Š GC (mean Β± Οƒ):  5.86% Β± 11.59%

  β–β–ƒβ–ˆβ–ˆβ–„β–ƒβ–ƒβ–ƒβ–‚ ▁ ▁                                               
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–„β–†β–β–…β–…β–„β–β–„β–…β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‡β–‡β–†β–†β–β–„β–… β–ˆ
  2.93 ms      Histogram: log(frequency) by time     10.4 ms <

 Memory estimate: 15.23 MiB, allocs estimate: 20.

BenchmarkTools.Trial: 726 samples with 1 evaluation.
 Range (min … max):  3.878 ms … 11.137 ms  β”Š GC (min … max): 0.00% … 57.65%
 Time  (median):     4.291 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.626 ms Β±  1.206 ms  β”Š GC (mean Β± Οƒ):  4.57% Β± 10.52%

   β–‚β–ˆβ–ˆβ–†β–ƒβ–‚β–‚β–‚  ▁▁                                               
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–‡β–†β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–…β–†β–†β–ˆ β–‡
  3.88 ms      Histogram: log(frequency) by time       11 ms <

 Memory estimate: 15.36 MiB, allocs estimate: 201.

BenchmarkTools.Trial: 350 samples with 1 evaluation.
 Range (min … max):  11.574 ms …  19.007 ms  β”Š GC (min … max): 0.00% … 38.25%
 Time  (median):     11.887 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   11.975 ms Β± 642.910 ΞΌs  β”Š GC (mean Β± Οƒ):  0.48% Β±  3.33%

         β–„β–‚β–„β–„β–„β–ˆβ–„β–„β–„β–†β–‚β–ƒ   β–ƒ                                       
  β–ƒβ–†β–†β–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–†β–ˆβ–„β–„β–„β–„β–…β–ƒβ–‡β–ƒβ–ƒβ–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–β–β–β–β–β–β–ƒβ–β–β–β–β–ƒβ–β–β–β–ƒβ–β–β–β–ƒ β–„
  11.6 ms         Histogram: frequency by time         12.9 ms <

 Memory estimate: 4.08 MiB, allocs estimate: 196.

fun1 is faster but fun3! uses less memory. Correct? What is the variable β€œallocs estimate”? Please let me know which function would you use. Thanks!

Huh. I dunno. I wonder if something is going on with the benchmarking. Maybe @bkamins has some intuition for why fun3! is so slow.

Yes, I can answer it:

julia> x = rand(10^6);

julia> bx = [v < 0.5 for v in x];

julia> ix = findall(!, bx);

julia> @benchmark z[bx] setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 745 samples with 1 evaluation.
 Range (min … max):  4.486 ms … 28.198 ms  β”Š GC (min … max): 0.00% … 81.56%
 Time  (median):     4.569 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.939 ms Β±  2.251 ms  β”Š GC (mean Β± Οƒ):  4.37% Β±  8.00%

  β–ˆβ–‡β–„β–„β–ƒβ–‚β–β–‚β–β–
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–ˆβ–‡β–‡β–†β–…β–†β–†β–β–…β–…β–…β–†β–‡β–†β–†β–„β–‡β–…β–…β–‡β–†β–„β–β–„β–β–β–β–„β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–„ β–‡
  4.49 ms      Histogram: log(frequency) by time     7.24 ms <

 Memory estimate: 3.82 MiB, allocs estimate: 2.

julia> @benchmark z[ix] setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 1398 samples with 1 evaluation.
 Range (min … max):  1.287 ms … 29.172 ms  β”Š GC (min … max):  0.00% … 94.65%
 Time  (median):     1.355 ms              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   1.699 ms Β±  2.259 ms  β”Š GC (mean Β± Οƒ):  12.15% Β±  8.70%

  β–ˆβ–ˆβ–…β–„β–‚β–β–                    ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–†β–„β–‡β–…β–„β–„β–„β–„β–„β–β–„β–„β–„β–„β–„β–†β–ˆβ–‡β–‡β–‡β–†β–„β–„β–„β–„β–…β–†β–…β–„β–„β–β–β–„β–„β–„β–β–β–„β–β–β–β–β–β–β–β–β–„ β–ˆ
  1.29 ms      Histogram: log(frequency) by time     4.27 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 2.

julia> @benchmark deleteat!(z, bx) setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 1764 samples with 1 evaluation.
 Range (min … max):  798.600 ΞΌs …   3.617 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     825.700 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   927.020 ΞΌs Β± 304.150 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–ˆβ–„β–‚β–‚β–‚β– ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–…β–…β–…β–‡β–…β–ƒβ–ƒβ–…β–…β–‡β–β–„β–…β–ƒβ–β–„β–β–ƒβ–β–ƒβ–β–β–β–β–„β–ƒβ–β–β–β–β–β–ƒβ–ƒβ–β–„β–ˆβ–ˆβ–‡β–‡β–‡β–†β–β–„β–β–„β–ƒβ–… β–ˆ
  799 ΞΌs        Histogram: log(frequency) by time       2.22 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark deleteat!(z, ix) setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 859 samples with 1 evaluation.
 Range (min … max):  3.794 ms …   6.477 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.828 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.976 ms Β± 316.706 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–†β–‚β–‚β–‚β–β–ƒβ–  ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–†β–…β–†β–‡β–…β–…β–‡β–†β–‡β–„β–…β–†β–…β–†β–…β–†β–†β–…β–β–…β–„β–„β–…β–…β–„β–…β–…β–„β–„β–†β–…β–β–…β–„β–β–…β–…β–…β–…β–β–β–β–… β–‡
  3.79 ms      Histogram: log(frequency) by time      5.21 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

And the current code in DataFrames.jl assumed that deleteat! and getindex had the same relationship in performance. Which clearly it has not. I will make a PR to fix it.

fixed in fix deleteat! and subset! performance by bkamins Β· Pull Request #3249 Β· JuliaData/DataFrames.jl Β· GitHub

Hi @bkamins. As I am new to julia, I would like to clarify a few things.
Could you explain me which is more performant: deleteat! or getindex ?
Finally, I would like to know if updating DataFrames.jl includes this fix and which function do you suggest me to use fun1, fun2 or the mutating fun3! (defined at the top of the thread) ?
Best

I looked at the PR and saw you removed the use of the findall function. What are you using instead of it? What is the relation with deleteat! ?
Thanks!

The answer to these questions is in my post above. The speed of operations is as follows (from fastest to slowest):

  • deleteat! with Bool index
  • getindex with integer index
  • deleteat! with integer index
  • getindex with Bool index

Note that getindex allocates a new vector, while deleteat! is in place. In the PR I remove findall as findall changes Bool index into integer index. This is desired for getindex, but bad for deleteat!. The original code for deleteat! was adapted from getindex, so findall stayed there, but should have been removed.

Thanks, I understood the ranking.
I am trying to understand what does the ! does in this expression:

Summing up, once I update the DataFrames package the best option will be to use my mutating fun3! function instead of the others?

findall(!, bx) means: find indices of all values in bx which are true after function ! is applied to them. Since ! is negation this means that we find indices of false values.

once I update the DataFrames package

Precisely - after the PR is merged and a patch release is made. Since this is not a bug but performance issue the release might not be within a few days (we release bug fixes on daily basis).

If you want an in-place function then in the long run fun3! will be fastest, but it might take some time till the patch is released as I commented.

1 Like

So bx is a boolean vector for those componentsv<0.5, while ix gives the indexes of the elements v>0.5. Isn’t the operations z[bx] and z[ix] doing opposite things? The same happens with deleteat!(z,bx) and deleteat!(z,ix) I guess? Thanks for the patience.

Yes, also the difference is that z[ix] and deleteat!(z, ix) do a different thing (one keeps indices, the other drops them). But since the threshold is 0.5 and we have 10^6 elements the number of elements to keep and to drop is roughly equal so it is not a problem.

I used ix = findall(!, bx) to make z[ix] and deleteat!(z, bx) equivalent as most likely you want the result of the operations to be the same and they are the performant options (but indeed I could have done it more carefully).