Pandas's fast isin() equivalent in Julia DataFrame or IndexedTable or anything else

Sijun · October 19, 2019, 10:04am

Python Pandas DataFrame’s isin() is really fast. for example:

import numpy as np
import pandas as pd

df = pd.DataFrame({"x": np.random.randint(1,100, size=(10**6,)),
                "y": np.random.randint(50,149, size=(10**6,))})


%timeit df[df.x.isin(df.y)]

42 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

But with Julia DataFrame,

using DataFrames

df = DataFrame(x=rand(1:100, 10^6), y=rand(50:149, 10^6))

@time df[in.(df.x, [df.y]), :] # @time at the second time

217.819652 seconds (37 allocations: 11.795 MiB, 0.00% gc time)

Fast isin() operation in DataFrame is very important since it is used a lot but I can’t find equivalently efficient solution in Julia. Besides I wonder how Pandas isin() could be so fast. It should take O(m x n) at best.

xiaodai · October 19, 2019, 10:37am

Well, the thing is in Python, you need someone to write the isin for you but in Julia if the isin isn’t available you can be confident that you can write your own. Here is one

The pandas API is huge! Which will take a long time to replicate, but you don’t need them if you have some knowledge of algorithms etc.

using DataFrames
using SortingAlgorithms
df = DataFrame(x=rand(1:100, 10^6), y=rand(50:149, 10^6))

isinsorted(a, x) = begin
    pos = searchsortedfirst(a, x)
    (pos <= length(a)) && (a[pos] == x)
end


isin1(x, y) = begin
    sorted_y = sort(y, alg=RadixSort)
    isinsorted.(Ref(sorted_y), x)
end

using BenchmarkTools
x = df.x
y = df.y
@benchmark isin1($x, $y)

@time df2 = df[isin1(df.x, df.y), :]

which should be pretty close, and this isn’t the most optimised. As you can see, you can write your own of isin using every little code

nalimilan · October 19, 2019, 10:51am

df[in.(df.x, [Set(df.y)]), :] is already much faster (BTW, df[in.(df.x, Ref(Set(df.y)), :] is more idiomatic). There’s also the more convoluted df[.!isnothing.(indexin(df.x, df.y)), :].

I think there have been discussions about making in.(...) use more efficient algorithms by default, but it’s hard to know in advance which approach is faster (e.g. x or y could be very short).

xiaodai · October 19, 2019, 10:57am

I knew there was a faster way…

Sijun · October 19, 2019, 11:03am

@nalimilan, @xiaodai, I can’t thank you enough. It works very well!

xiaodai · October 19, 2019, 11:05am

If you really NEED speed, you can make sure that the dataframe is sorted by x so df.x is sorted, then you can write an even faster algorithm!

Topic		Replies	Views
DataFrame isin operation New to Julia dataframes	4	1104	December 9, 2021
Fast in.(x, Ref(y)) Performance	4	510	May 18, 2020
[DataFrames Question]: hash-based row indexing for DataFrames package Data question , suggestions	16	2233	October 16, 2019
In function runs really slow Data dataframes	3	481	March 29, 2022
Can this be made faster? Performance dataframes	5	541	March 19, 2022

Pandas's fast isin() equivalent in Julia DataFrame or IndexedTable or anything else

Related topics