Why is this dataframe filter slow?

I need to extract rows from a dataframe based on values of a specific column. I tried to use filter but it’s slower than traditional findall. Am I missing something?

df = DataFrame(a=["a", "b", "c"], b=[1,2,3])

@elapsed myrow = filter(row -> row[:a] == "a" , df)

@elapsed myrow = df[findall(df.a.=="a"),:]

Here is the result with a larger df and times shown.

julia> using DataFrames; n = 100000
100000

julia> df = DataFrame(a=rand(["a", "b", "c"], n), b=rand([1,2,3], n));

julia> @elapsed myrow = df[findall(df.a.=="a"),:]
0.001101784

julia> @elapsed myrow = filter(row -> row[:a] == "a" , df)
0.207868247
1 Like
julia> @elapsed myrow = filter(row -> row[:a] == "a" , df)
0.1900758

julia> @elapsed myrow = df[findall(df.a.=="a"),:]
4.57e-5

There is something strange with @elapsed. If I use BenchmarkTools I get:

@benchmark myrow = filter(row -> row[:a] == "a" , df)
median time:      3.300 μs (0.00% GC)

@benchmark myrow = df[findall(df.a.=="a"),:]
median time:      2.540 μs (0.00% GC)

So it seems filter is not slow.

On the larger df it is slow for me


julia> @benchmark myrow = filter(row -> row[:a] == "a" , df)
BenchmarkTools.Trial: 
  memory estimate:  3.82 MiB
  allocs estimate:  199516
  --------------
  minimum time:     12.073 ms (0.00% GC)
  median time:      12.239 ms (0.00% GC)
  mean time:        12.414 ms (0.90% GC)
  maximum time:     14.290 ms (9.28% GC)
  --------------
  samples:          403
  evals/sample:     1

julia> @benchmark myrow = df[findall(df.a.=="a"),:]
BenchmarkTools.Trial: 
  memory estimate:  792.34 KiB
  allocs estimate:  23
  --------------
  minimum time:     953.499 μs (0.00% GC)
  median time:      967.685 μs (0.00% GC)
  mean time:        987.381 μs (1.53% GC)
  maximum time:     2.213 ms (49.67% GC)
  --------------
  samples:          5047
  evals/sample:     1

You are right, for large dataframes findall is ~10x faster:

julia> @benchmark myrow = filter(row -> row[:a] == "a" , df)
BenchmarkTools.Trial:
  memory estimate:  3.82 MiB
  allocs estimate:  199516
  --------------
  minimum time:     13.712 ms (0.00% GC)
  median time:      17.261 ms (0.00% GC)
  mean time:        18.321 ms (3.79% GC)
  maximum time:     62.593 ms (56.82% GC)
  --------------
  samples:          274
  evals/sample:     1

julia> @benchmark myrow = df[findall(df.a.=="a"),:]
BenchmarkTools.Trial:
  memory estimate:  799.47 KiB
  allocs estimate:  23
  --------------
  minimum time:     1.323 ms (0.00% GC)
  median time:      1.952 ms (0.00% GC)
  mean time:        2.101 ms (4.65% GC)
  maximum time:     30.440 ms (91.51% GC)
  --------------
  samples:          2366
  evals/sample:     1

… and @elapsed is acting strange

The filter version is not type stable; on each iteration it needs to lookup the type of row[:a] and dispatch to the right code. The type stable version is

myrow = filter(:a => (x -> x == "a") , df)

or shorter,

myrow = filter(:a => ==("a") , df)

See Why DataFrame is not type stable and when it matters | Blog by Bogumił Kamiński for more on type stability with DataFrames.

10 Likes