Fast in.(x, Ref(y))

I often want to filter a large Dataframe by checking whether the values of a column are elements of another large vector.

Naively broadcasting in across the first vector is very slow:

julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n)));
  2.274879 seconds (12 allocations: 1.542 MiB)

As a baseline, kdb+/q can do the same 2000x faster for 10^5 and scales linearly with n (I interrupted Julia after it took over several minutes for 10^6):

q)\t (n?n)in n?n:prd 5#10
1
q)\t (n?n)in n?n:prd 6#10
16

(the time results are 1 and 16 milliseconds)

Obviously there’s a more efficient algorithm. Does anyone know if Julia has it implemented somewhere?

Just make it a Set:

julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n)));
  2.439009 seconds (12 allocations: 1.542 MiB)

julia> n = 10^5; @time in.(rand(1:n, n), Ref(Set(rand(1:n, n))));
  0.007638 seconds (20 allocations: 2.668 MiB)

julia> n = 10^6; @time in.(rand(1:n, n), Ref(Set(rand(1:n, n))));
  0.089985 seconds (21 allocations: 24.383 MiB)

There’s been an issue about doing this automatically, but it is slower for small n and the extra upfront work may be unexpected.

5 Likes

For DataFrames, you might have better luck with using join but I’m not sure that’s been optimized very much. Or, you can use a O(1) lookup structure like a Set.

julia> @btime in.($(rand(1:n, n)), $(Ref(Set(rand(1:n, n)))))
  1.997 ms (3 allocations: 16.59 KiB)

(note that this doesn’t time the creation of the Set, just the lookup bit).

Thanks! That indeed gets me within 3x the performance of kdb+/q for 10^5 and 10^6.

If you are using the latest dataframes you can probably get a speed-up from using the new filtering syntax

filter(:var => t -> t in Set(x), df)