Fast isin() operation in DataFrame is very important since it is used a lot but I can’t find equivalently efficient solution in Julia. Besides I wonder how Pandas isin() could be so fast. It should take O(m x n) at best.
Well, the thing is in Python, you need someone to write the isin for you but in Julia if the isin isn’t available you can be confident that you can write your own. Here is one
The pandas API is huge! Which will take a long time to replicate, but you don’t need them if you have some knowledge of algorithms etc.
using DataFrames
using SortingAlgorithms
df = DataFrame(x=rand(1:100, 10^6), y=rand(50:149, 10^6))
isinsorted(a, x) = begin
pos = searchsortedfirst(a, x)
(pos <= length(a)) && (a[pos] == x)
end
isin1(x, y) = begin
sorted_y = sort(y, alg=RadixSort)
isinsorted.(Ref(sorted_y), x)
end
using BenchmarkTools
x = df.x
y = df.y
@benchmark isin1($x, $y)
@time df2 = df[isin1(df.x, df.y), :]
which should be pretty close, and this isn’t the most optimised. As you can see, you can write your own of isin using every little code
df[in.(df.x, [Set(df.y)]), :] is already much faster (BTW, df[in.(df.x, Ref(Set(df.y)), :] is more idiomatic). There’s also the more convoluted df[.!isnothing.(indexin(df.x, df.y)), :].
I think there have been discussions about making in.(...) use more efficient algorithms by default, but it’s hard to know in advance which approach is faster (e.g. x or y could be very short).