I was expecting that the mutating function fun3! would be the optimal regarding time and memory. Why this is not true ?
The other question is how can I @benchmark a mutating function?
Doing @benchmark fun3!(df, bool_x) gives error because the size of dataframe argument gets modified in each step of the process.
Finally, I would like to know if there are other better options to filter a dataframe that I havenβt considered.
Do you know why the first function is the best in terms of speed and memory usage? I was expecting that the mutating function to be the best. Are there other options?
I know nothing about the @subset macros, but the collect(bool_vector) you are using in both other cases by itself is allocating an intermediate array which I guess is not necessary. I donβt see why you are collecting it there, as afaik collecting it just returns a copy of the same vector in this case.
Thanks @pdeffebach.
Once this bug is solved, using the mutating @subset! will be the optimum way to filter a dataframe, or is still the above fun1 function the best ?
fun1 is the best because there is a little overhead in the DataFrames.subset function that @subset calls. This overhead matters because your data frame is so small. Try doing with a million observations and see what happens.
Thanks!
For dataframe with a million rows I obtain the following:
BenchmarkTools.Trial: 829 samples with 1 evaluation.
Range (min β¦ max): 2.934 ms β¦ 12.817 ms β GC (min β¦ max): 0.00% β¦ 59.91%
Time (median): 3.368 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.739 ms Β± 1.274 ms β GC (mean Β± Ο): 5.86% Β± 11.59%
βββββββββ β β
βββββββββββββββββββ β ββββ ββββββββββββββββββββββββββββββββββ β
2.93 ms Histogram: log(frequency) by time 10.4 ms <
Memory estimate: 15.23 MiB, allocs estimate: 20.
BenchmarkTools.Trial: 726 samples with 1 evaluation.
Range (min β¦ max): 3.878 ms β¦ 11.137 ms β GC (min β¦ max): 0.00% β¦ 57.65%
Time (median): 4.291 ms β GC (median): 0.00%
Time (mean Β± Ο): 4.626 ms Β± 1.206 ms β GC (mean Β± Ο): 4.57% Β± 10.52%
ββββββββ ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ βββ β
3.88 ms Histogram: log(frequency) by time 11 ms <
Memory estimate: 15.36 MiB, allocs estimate: 201.
BenchmarkTools.Trial: 350 samples with 1 evaluation.
Range (min β¦ max): 11.574 ms β¦ 19.007 ms β GC (min β¦ max): 0.00% β¦ 38.25%
Time (median): 11.887 ms β GC (median): 0.00%
Time (mean Β± Ο): 11.975 ms Β± 642.910 ΞΌs β GC (mean Β± Ο): 0.48% Β± 3.33%
ββββββββββββ β
βββββββββββββββββββββ βββββββ ββββββββββββββββββββββββββββββββ β
11.6 ms Histogram: frequency by time 12.9 ms <
Memory estimate: 4.08 MiB, allocs estimate: 196.
fun1 is faster but fun3! uses less memory. Correct? What is the variable βallocs estimateβ? Please let me know which function would you use. Thanks!
julia> x = rand(10^6);
julia> bx = [v < 0.5 for v in x];
julia> ix = findall(!, bx);
julia> @benchmark z[bx] setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 745 samples with 1 evaluation.
Range (min β¦ max): 4.486 ms β¦ 28.198 ms β GC (min β¦ max): 0.00% β¦ 81.56%
Time (median): 4.569 ms β GC (median): 0.00%
Time (mean Β± Ο): 4.939 ms Β± 2.251 ms β GC (mean Β± Ο): 4.37% Β± 8.00%
ββββββββββ
ββββββββββββββββββ ββββ β β βββββββ β ββββββββββββββββββββββββββ β
4.49 ms Histogram: log(frequency) by time 7.24 ms <
Memory estimate: 3.82 MiB, allocs estimate: 2.
julia> @benchmark z[ix] setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 1398 samples with 1 evaluation.
Range (min β¦ max): 1.287 ms β¦ 29.172 ms β GC (min β¦ max): 0.00% β¦ 94.65%
Time (median): 1.355 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.699 ms Β± 2.259 ms β GC (mean Β± Ο): 12.15% Β± 8.70%
βββ ββββ β
βββββββββββββββ ββββββββββββββββββββββ ββ βββββββββββββββββββ β
1.29 ms Histogram: log(frequency) by time 4.27 ms <
Memory estimate: 3.81 MiB, allocs estimate: 2.
julia> @benchmark deleteat!(z, bx) setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 1764 samples with 1 evaluation.
Range (min β¦ max): 798.600 ΞΌs β¦ 3.617 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 825.700 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 927.020 ΞΌs Β± 304.150 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββ β
βββββββββββββββ β β ββ βββ β ββββ ββββββββββββββββββββββββββββββββββ β
799 ΞΌs Histogram: log(frequency) by time 2.22 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark deleteat!(z, ix) setup=(z=copy(x)) evals=1
BenchmarkTools.Trial: 859 samples with 1 evaluation.
Range (min β¦ max): 3.794 ms β¦ 6.477 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.828 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.976 ms Β± 316.706 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββ β
ββββββββββββββββββ βββ β βββββ ββ ββ βββ ββ βββ β ββ β ββββ ββ βββ β β β ββββ β
3.79 ms Histogram: log(frequency) by time 5.21 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
And the current code in DataFrames.jl assumed that deleteat! and getindex had the same relationship in performance. Which clearly it has not. I will make a PR to fix it.
Hi @bkamins. As I am new to julia, I would like to clarify a few things.
Could you explain me which is more performant: deleteat! or getindex ?
Finally, I would like to know if updating DataFrames.jl includes this fix and which function do you suggest me to use fun1, fun2 or the mutating fun3! (defined at the top of the thread) ?
Best
I looked at the PR and saw you removed the use of the findall function. What are you using instead of it? What is the relation with deleteat! ?
Thanks!
The answer to these questions is in my post above. The speed of operations is as follows (from fastest to slowest):
deleteat! with Bool index
getindex with integer index
deleteat! with integer index
getindex with Bool index
Note that getindex allocates a new vector, while deleteat! is in place. In the PR I remove findall as findall changes Bool index into integer index. This is desired for getindex, but bad for deleteat!. The original code for deleteat! was adapted from getindex, so findall stayed there, but should have been removed.
findall(!, bx) means: find indices of all values in bx which are true after function ! is applied to them. Since ! is negation this means that we find indices of false values.
once I update the DataFrames package
Precisely - after the PR is merged and a patch release is made. Since this is not a bug but performance issue the release might not be within a few days (we release bug fixes on daily basis).
If you want an in-place function then in the long run fun3! will be fastest, but it might take some time till the patch is released as I commented.
So bx is a boolean vector for those componentsv<0.5, while ix gives the indexes of the elements v>0.5. Isnβt the operations z[bx] and z[ix] doing opposite things? The same happens with deleteat!(z,bx) and deleteat!(z,ix) I guess? Thanks for the patience.
Yes, also the difference is that z[ix] and deleteat!(z, ix) do a different thing (one keeps indices, the other drops them). But since the threshold is 0.5 and we have 10^6 elements the number of elements to keep and to drop is roughly equal so it is not a problem.
I used ix = findall(!, bx) to make z[ix] and deleteat!(z, bx) equivalent as most likely you want the result of the operations to be the same and they are the performant options (but indeed I could have done it more carefully).