I often want to filter a large Dataframe by checking whether the values of a column are elements of another large vector.
Naively broadcasting in
across the first vector is very slow:
julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n)));
2.274879 seconds (12 allocations: 1.542 MiB)
As a baseline, kdb+/q can do the same 2000x faster for 10^5 and scales linearly with n
(I interrupted Julia after it took over several minutes for 10^6):
q)\t (n?n)in n?n:prd 5#10
1
q)\t (n?n)in n?n:prd 6#10
16
(the time results are 1 and 16 milliseconds)
Obviously there’s a more efficient algorithm. Does anyone know if Julia has it implemented somewhere?
Just make it a Set
:
julia> n = 10^5; @time in.(rand(1:n, n), Ref(rand(1:n, n)));
2.439009 seconds (12 allocations: 1.542 MiB)
julia> n = 10^5; @time in.(rand(1:n, n), Ref(Set(rand(1:n, n))));
0.007638 seconds (20 allocations: 2.668 MiB)
julia> n = 10^6; @time in.(rand(1:n, n), Ref(Set(rand(1:n, n))));
0.089985 seconds (21 allocations: 24.383 MiB)
There’s been an issue about doing this automatically, but it is slower for small n
and the extra upfront work may be unexpected.
5 Likes
For DataFrames, you might have better luck with using join
but I’m not sure that’s been optimized very much. Or, you can use a O(1) lookup structure like a Set
.
julia> @btime in.($(rand(1:n, n)), $(Ref(Set(rand(1:n, n)))))
1.997 ms (3 allocations: 16.59 KiB)
(note that this doesn’t time the creation of the Set
, just the lookup bit).
Thanks! That indeed gets me within 3x the performance of kdb+/q for 10^5 and 10^6.
If you are using the latest dataframes you can probably get a speed-up from using the new filtering syntax
filter(:var => t -> t in Set(x), df)