 # Fast in.(x, Ref(y))

I often want to filter a large Dataframe by checking whether the values of a column are elements of another large vector.

Naively broadcasting `in` across the first vector is very slow:

``````julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n)));
2.274879 seconds (12 allocations: 1.542 MiB)
``````

As a baseline, kdb+/q can do the same 2000x faster for 10^5 and scales linearly with `n` (I interrupted Julia after it took over several minutes for 10^6):

``````q)\t (n?n)in n?n:prd 5#10
1
q)\t (n?n)in n?n:prd 6#10
16
``````

(the time results are 1 and 16 milliseconds)

Obviously there’s a more efficient algorithm. Does anyone know if Julia has it implemented somewhere?

Just make it a `Set`:

``````julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n)));
2.439009 seconds (12 allocations: 1.542 MiB)

julia> n = 10^5; @time in.(rand(1:n, n), Ref(Set(rand(1:n, n))));
0.007638 seconds (20 allocations: 2.668 MiB)

julia> n = 10^6; @time in.(rand(1:n, n), Ref(Set(rand(1:n, n))));
0.089985 seconds (21 allocations: 24.383 MiB)
``````

There’s been an issue about doing this automatically, but it is slower for small `n` and the extra upfront work may be unexpected.

5 Likes

For DataFrames, you might have better luck with using `join` but I’m not sure that’s been optimized very much. Or, you can use a O(1) lookup structure like a `Set`.

``````julia> @btime in.(\$(rand(1:n, n)), \$(Ref(Set(rand(1:n, n)))))
1.997 ms (3 allocations: 16.59 KiB)
``````

(note that this doesn’t time the creation of the `Set`, just the lookup bit).

Thanks! That indeed gets me within 3x the performance of kdb+/q for 10^5 and 10^6.

If you are using the latest dataframes you can probably get a speed-up from using the new filtering syntax

``````filter(:var => t -> t in Set(x), df)
``````