I need to extract rows from a dataframe based on values of a specific column. I tried to use filter
but it’s slower than traditional findall
. Am I missing something?
df = DataFrame(a=["a", "b", "c"], b=[1,2,3])
@elapsed myrow = filter(row -> row[:a] == "a" , df)
@elapsed myrow = df[findall(df.a.=="a"),:]
jzr
May 14, 2021, 9:25pm
2
Here is the result with a larger df and times shown.
julia> using DataFrames; n = 100000
100000
julia> df = DataFrame(a=rand(["a", "b", "c"], n), b=rand([1,2,3], n));
julia> @elapsed myrow = df[findall(df.a.=="a"),:]
0.001101784
julia> @elapsed myrow = filter(row -> row[:a] == "a" , df)
0.207868247
1 Like
julia> @elapsed myrow = filter(row -> row[:a] == "a" , df)
0.1900758
julia> @elapsed myrow = df[findall(df.a.=="a"),:]
4.57e-5
There is something strange with @elapsed
. If I use BenchmarkTools I get:
@benchmark myrow = filter(row -> row[:a] == "a" , df)
median time: 3.300 μs (0.00% GC)
@benchmark myrow = df[findall(df.a.=="a"),:]
median time: 2.540 μs (0.00% GC)
So it seems filter is not slow.
jzr
May 14, 2021, 9:34pm
4
On the larger df it is slow for me
julia> @benchmark myrow = filter(row -> row[:a] == "a" , df)
BenchmarkTools.Trial:
memory estimate: 3.82 MiB
allocs estimate: 199516
--------------
minimum time: 12.073 ms (0.00% GC)
median time: 12.239 ms (0.00% GC)
mean time: 12.414 ms (0.90% GC)
maximum time: 14.290 ms (9.28% GC)
--------------
samples: 403
evals/sample: 1
julia> @benchmark myrow = df[findall(df.a.=="a"),:]
BenchmarkTools.Trial:
memory estimate: 792.34 KiB
allocs estimate: 23
--------------
minimum time: 953.499 μs (0.00% GC)
median time: 967.685 μs (0.00% GC)
mean time: 987.381 μs (1.53% GC)
maximum time: 2.213 ms (49.67% GC)
--------------
samples: 5047
evals/sample: 1
You are right, for large dataframes findall is ~10x faster:
julia> @benchmark myrow = filter(row -> row[:a] == "a" , df)
BenchmarkTools.Trial:
memory estimate: 3.82 MiB
allocs estimate: 199516
--------------
minimum time: 13.712 ms (0.00% GC)
median time: 17.261 ms (0.00% GC)
mean time: 18.321 ms (3.79% GC)
maximum time: 62.593 ms (56.82% GC)
--------------
samples: 274
evals/sample: 1
julia> @benchmark myrow = df[findall(df.a.=="a"),:]
BenchmarkTools.Trial:
memory estimate: 799.47 KiB
allocs estimate: 23
--------------
minimum time: 1.323 ms (0.00% GC)
median time: 1.952 ms (0.00% GC)
mean time: 2.101 ms (4.65% GC)
maximum time: 30.440 ms (91.51% GC)
--------------
samples: 2366
evals/sample: 1
… and @elapsed
is acting strange
The filter version is not type stable; on each iteration it needs to lookup the type of row[:a]
and dispatch to the right code. The type stable version is
myrow = filter(:a => (x -> x == "a") , df)
or shorter,
myrow = filter(:a => ==("a") , df)
See Why DataFrame is not type stable and when it matters | Blog by Bogumił Kamiński for more on type stability with DataFrames.
10 Likes