Fast in.(x, Ref(y))

I often want to filter a large Dataframe by checking whether the values of a column are elements of another large vector.

Naively broadcasting in across the first vector is very slow:

julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n)));
  2.274879 seconds (12 allocations: 1.542 MiB)

As a baseline, kdb+/q can do the same 2000x faster for 10^5 and scales linearly with n (I interrupted Julia after it took over several minutes for 10^6):

q)\t (n?n)in n?n:prd 5#10
1
q)\t (n?n)in n?n:prd 6#10
16

(the time results are 1 and 16 milliseconds)

Obviously there’s a more efficient algorithm. Does anyone know if Julia has it implemented somewhere?

Just make it a Set:

julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n)));
  2.439009 seconds (12 allocations: 1.542 MiB)

julia> n = 10^5; @time in.(rand(1:n, n), Ref(Set(rand(1:n, n))));
  0.007638 seconds (20 allocations: 2.668 MiB)

julia> n = 10^6; @time in.(rand(1:n, n), Ref(Set(rand(1:n, n))));
  0.089985 seconds (21 allocations: 24.383 MiB)

There’s been an issue about doing this automatically, but it is slower for small n and the extra upfront work may be unexpected.

For DataFrames, you might have better luck with using join but I’m not sure that’s been optimized very much. Or, you can use a O(1) lookup structure like a Set.

julia> @btime in.($(rand(1:n, n)), $(Ref(Set(rand(1:n, n)))))
  1.997 ms (3 allocations: 16.59 KiB)

(note that this doesn’t time the creation of the Set, just the lookup bit).

Thanks! That indeed gets me within 3x the performance of kdb+/q for 10^5 and 10^6.

If you are using the latest dataframes you can probably get a speed-up from using the new filtering syntax

filter(:var => t -> t in Set(x), df)