Fast in.(x, Ref(y))

robsmith11 · May 18, 2020, 3:14pm

I often want to filter a large Dataframe by checking whether the values of a column are elements of another large vector.

Naively broadcasting in across the first vector is very slow:

julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n)));
  2.274879 seconds (12 allocations: 1.542 MiB)

As a baseline, kdb+/q can do the same 2000x faster for 10^5 and scales linearly with n (I interrupted Julia after it took over several minutes for 10^6):

q)\t (n?n)in n?n:prd 5#10
1
q)\t (n?n)in n?n:prd 6#10
16

(the time results are 1 and 16 milliseconds)

Obviously there’s a more efficient algorithm. Does anyone know if Julia has it implemented somewhere?

mbauman · May 18, 2020, 3:20pm

Just make it a Set:

julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n)));
  2.439009 seconds (12 allocations: 1.542 MiB)

julia> n = 10^5; @time in.(rand(1:n, n), Ref(Set(rand(1:n, n))));
  0.007638 seconds (20 allocations: 2.668 MiB)

julia> n = 10^6; @time in.(rand(1:n, n), Ref(Set(rand(1:n, n))));
  0.089985 seconds (21 allocations: 24.383 MiB)

There’s been an issue about doing this automatically, but it is slower for small n and the extra upfront work may be unexpected.

dave.f.kleinschmidt · May 18, 2020, 3:23pm

For DataFrames, you might have better luck with using join but I’m not sure that’s been optimized very much. Or, you can use a O(1) lookup structure like a Set.

julia> @btime in.($(rand(1:n, n)), $(Ref(Set(rand(1:n, n)))))
  1.997 ms (3 allocations: 16.59 KiB)

(note that this doesn’t time the creation of the Set, just the lookup bit).

robsmith11 · May 18, 2020, 3:23pm

Thanks! That indeed gets me within 3x the performance of kdb+/q for 10^5 and 10^6.

pdeffebach · May 18, 2020, 3:26pm

If you are using the latest dataframes you can probably get a speed-up from using the new filtering syntax

filter(:var => t -> t in Set(x), df)

Topic		Replies	Views
In function runs really slow Data dataframes	3	482	March 29, 2022
Is there a faster way to filter in dataframes New to Julia dataframes , namedtuple	9	1293	August 29, 2021
Julia's in.() seems slow compared to R's %in% Performance	40	3018	May 9, 2019
What is Julia's equivalent of R's %in% New to Julia	9	2840	November 21, 2019
Pandas's fast isin() equivalent in Julia DataFrame or IndexedTable or anything else Data question	5	2985	October 19, 2019

Fast in.(x, Ref(y))

Related topics