I’m facing a problem where I would like to filter a dataset by using only those IDs that occur in another vector (that is around 100,000 elements long). If I just use in that takes forever as, I assume it checks every element in that vector. Is there a way to make this lookup faster similar to a Dict?
Small example
using Random, StatsBase
Random.seed!(42);
ids = [randstring(10) for _ in 1:100_000];
ids_use = sample(ids, 10_000, replace=false);
julia> @time filter(id -> id ∈ ids_use, ids)
6.725424 seconds (10.95 k allocations: 1.281 MiB)