I often want to filter a large Dataframe by checking whether the values of a column are elements of another large vector.
in across the first vector is very slow:
julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n))); 2.274879 seconds (12 allocations: 1.542 MiB)
As a baseline, kdb+/q can do the same 2000x faster for 10^5 and scales linearly with
n (I interrupted Julia after it took over several minutes for 10^6):
q)\t (n?n)in n?n:prd 5#10 1 q)\t (n?n)in n?n:prd 6#10 16
(the time results are 1 and 16 milliseconds)
Obviously there’s a more efficient algorithm. Does anyone know if Julia has it implemented somewhere?