Pandas's fast isin() equivalent in Julia DataFrame or IndexedTable or anything else

Python Pandas DataFrame’s isin() is really fast. for example:

import numpy as np
import pandas as pd

df = pd.DataFrame({"x": np.random.randint(1,100, size=(10**6,)),
                "y": np.random.randint(50,149, size=(10**6,))})


%timeit df[df.x.isin(df.y)]

42 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

But with Julia DataFrame,

using DataFrames

df = DataFrame(x=rand(1:100, 10^6), y=rand(50:149, 10^6))

@time df[in.(df.x, [df.y]), :] # @time at the second time

217.819652 seconds (37 allocations: 11.795 MiB, 0.00% gc time)

Fast isin() operation in DataFrame is very important since it is used a lot but I can’t find equivalently efficient solution in Julia. Besides I wonder how Pandas isin() could be so fast. It should take O(m x n) at best.

Well, the thing is in Python, you need someone to write the isin for you but in Julia if the isin isn’t available you can be confident that you can write your own. Here is one

The pandas API is huge! Which will take a long time to replicate, but you don’t need them if you have some knowledge of algorithms etc.

using DataFrames
using SortingAlgorithms
df = DataFrame(x=rand(1:100, 10^6), y=rand(50:149, 10^6))

isinsorted(a, x) = begin
    pos = searchsortedfirst(a, x)
    (pos <= length(a)) && (a[pos] == x)
end


isin1(x, y) = begin
    sorted_y = sort(y, alg=RadixSort)
    isinsorted.(Ref(sorted_y), x)
end

using BenchmarkTools
x = df.x
y = df.y
@benchmark isin1($x, $y)

@time df2 = df[isin1(df.x, df.y), :]

which should be pretty close, and this isn’t the most optimised. As you can see, you can write your own of isin using every little code

2 Likes

df[in.(df.x, [Set(df.y)]), :] is already much faster (BTW, df[in.(df.x, Ref(Set(df.y)), :] is more idiomatic). There’s also the more convoluted df[.!isnothing.(indexin(df.x, df.y)), :].

I think there have been discussions about making in.(...) use more efficient algorithms by default, but it’s hard to know in advance which approach is faster (e.g. x or y could be very short).

4 Likes

I knew there was a faster way…

@nalimilan, @xiaodai, I can’t thank you enough. It works very well! :slight_smile:

If you really NEED speed, you can make sure that the dataframe is sorted by x so df.x is sorted, then you can write an even faster algorithm!