Fastest way to filter when right hand sinde of `in` is large

I’m facing a problem where I would like to filter a dataset by using only those IDs that occur in another vector (that is around 100,000 elements long). If I just use in that takes forever as, I assume it checks every element in that vector. Is there a way to make this lookup faster similar to a Dict?

Small example

using Random, StatsBase
Random.seed!(42);
ids = [randstring(10) for _ in 1:100_000];
ids_use = sample(ids, 10_000, replace=false);
julia> @time filter(id -> id ∈ ids_use, ids)
  6.725424 seconds (10.95 k allocations: 1.281 MiB)

Thanks!

Use a set instead of an array for the right hand side.
Membership check in a set scales with O(1), whereas an array it scales with O(n).

7 Likes

Thank you this is the solution!

For completeness:

ids_set = Set(ids_use)
julia> @time filter(id -> id ∈ ids_set, ids)
  0.032714 seconds (10.95 k allocations: 1.282 MiB)

I also find the implementation of Set interesting

struct Set{T} <: AbstractSet{T}
    dict::Dict{T,Nothing}

    Set{T}() where {T} = new(Dict{T,Nothing}())
    Set{T}(s::Set{T}) where {T} = new(Dict{T,Nothing}(s.dict))
end

Explains why it is as fast as a Dict :wink:

5 Likes