Fastest way to filter when right hand sinde of `in` is large

danielw2904 · December 28, 2020, 6:18pm

I’m facing a problem where I would like to filter a dataset by using only those IDs that occur in another vector (that is around 100,000 elements long). If I just use in that takes forever as, I assume it checks every element in that vector. Is there a way to make this lookup faster similar to a Dict?

Small example

using Random, StatsBase
Random.seed!(42);
ids = [randstring(10) for _ in 1:100_000];
ids_use = sample(ids, 10_000, replace=false);

julia> @time filter(id -> id ∈ ids_use, ids)
  6.725424 seconds (10.95 k allocations: 1.281 MiB)

Thanks!

lungben · December 28, 2020, 6:23pm

Use a set instead of an array for the right hand side.
Membership check in a set scales with O(1), whereas an array it scales with O(n).

danielw2904 · December 28, 2020, 6:30pm

Thank you this is the solution!

For completeness:

ids_set = Set(ids_use)

julia> @time filter(id -> id ∈ ids_set, ids)
  0.032714 seconds (10.95 k allocations: 1.282 MiB)

danielw2904 · December 28, 2020, 6:48pm

I also find the implementation of Set interesting

struct Set{T} <: AbstractSet{T}
    dict::Dict{T,Nothing}

    Set{T}() where {T} = new(Dict{T,Nothing}())
    Set{T}(s::Set{T}) where {T} = new(Dict{T,Nothing}(s.dict))
end

Explains why it is as fast as a Dict

Topic		Replies	Views
Fast in.(x, Ref(y)) Performance	4	520	May 18, 2020
Some_thing in some_list New to Julia question	11	489	January 29, 2024
In function runs really slow Data dataframes	3	482	March 29, 2022
Julia's in.() seems slow compared to R's %in% Performance	40	3019	May 9, 2019
Comprehension vs map and filter unexpected speeds General Usage question	22	1717	November 20, 2019

Fastest way to filter when right hand sinde of `in` is large

Related topics