# Best data structure for fast unions of large sets of integers

In the SparseConnectivityTracer.jl project, @hill and I need to efficiencly compute unions of non-disjoint sets of integers, which can be very large (thousands, even millions). The default `Set{Int}` is not bad, but I’m wondering if there’s a better data structure out there, perhaps in DataStructures.jl (ping @oxinabox).
My intuition is that hash sets are mainly optimized for checking `x in s` in O(1) amortized time. In our application, we never check `x in s`, but we do lots and lots of `union(s1, s2)`. So maybe a sorted `Vector{Int}` would be faster? Any other clues from CS-minded people?

3 Likes

No answers from me, but I think for a proper nerdsnipe you’ll want to set up a little toy problem that runs in ~1s and is close to your real-life workload.

5 Likes

I had to do something similar and ended up using two sorted lists. (I couldn’t find a perfectly suiting datatype back then.)

Just iterate through them in parallel which means you can find the union in O(n+m) time (+ the sorting runtime).

1 Like

A `DataStructures.SortedVector` might be a suitable type for efficiently creating and working with such sorted lists.
(IIRC, under the hood it is using fancy array based tree logic)

2 Likes

Mh, any particular reason why a more minimal example doesn’t apply? Like:

``````N, M = 100_000, 200_000
X = 1:1_000_000
A, B = rand(X, N), rand(X, M)
union(A,B)
``````

It’s a bit difficult to track back what exactly needs to be optimized in `TracerSparsityDetector`

1 Like

That minimal example does essentially what I want indeed. Gonna delete the full use case

`BitSet` comes to mind. If the sets are reasonably dense, I’m not sure it’s possible to do much better.

2 Likes

The sets are gonna be rather sparse in practice, so `Set` might prevail for really large sizes.
Here’s a benchmarking code parametrized by set type:

``````using BenchmarkTools

function benchmark_settype(::Type{S}; universe_size, set_size) where {S}
@btime union(A, B) setup = ( #
A = \$S(rand(1:\$universe_size, \$set_size));
B = \$S(rand(1:\$universe_size, \$set_size))
)
return nothing
end
``````

And some results on my machine:

``````julia> benchmark_settype(Set{UInt}; universe_size=10^6, set_size=10^3)
55.236 μs (11 allocations: 54.87 KiB)

julia> benchmark_settype(BitSet; universe_size=10^6, set_size=10^3)
11.826 μs (3 allocations: 121.75 KiB)
``````
1 Like

Where do you find it in DataStructures.jl? The only mention I see is an unmerged PR

1 Like

Because it is fast and easy i did the timing for the DataStructures types

``````julia> benchmark_settype(Set{UInt}; universe_size=10^6, set_size=10^3)
21.479 μs (11 allocations: 54.27 KiB)

julia> benchmark_settype(BitSet; universe_size=10^6, set_size=10^3)
8.531 μs (7 allocations: 387.44 KiB)

julia> benchmark_settype(Vector{UInt}; universe_size=10^6, set_size=10^3)
26.179 μs (30 allocations: 102.38 KiB)

julia> benchmark_settype(DataStructures.SortedSet{UInt}; universe_size=10^6, set_size=10^3)
131.859 μs (35 allocations: 237.55 KiB)

julia> benchmark_settype(DataStructures.DisjointSets{UInt}; universe_size=10^6, set_size=10^3)
15.380 μs (11 allocations: 54.27 KiB)

julia> benchmark_settype(DataStructures.IntSet; universe_size=10^6, set_size=10^3)
19.959 μs (10 allocations: 183.72 KiB)
``````

Its gonna be hard to bet `BitSet` imo.
Becuase bitwise-or is a very fast operation.
And you are only talking about millions. which means only tens of thousands of or operations
Maybe you can do something like a bitset but with an extra indexing layer on top which tracks which of the underlaying UInts in the bitset are all zero, so you can skip unioning those

3 Likes

It seems to really depend on the sparsity.
The previous benchmarks were with 10^{-3} sparsity. If instead I do 10^{-4} the leaderboard changes:

``````julia> benchmark_settype(Set{UInt}; universe_size=10^6, set_size=10^2)
4.742 μs (10 allocations: 7.46 KiB)

julia> benchmark_settype(BitSet; universe_size=10^6, set_size=10^2)
11.279 μs (3 allocations: 116.19 KiB)
``````
2 Likes

To add nuance to this problem: the set types are allowed to know the range of values `1:universe_size` they could contain.

1 Like

Do you need it to be perfectly accurate?
If not Bloomfilter like techneques could speed it up a lot.
In proportion to how much inaccurasy you are willing to accept

1 Like

I don’t necessarily need full accuracy. My code would be correct with `C = union_overestimate(A, B)` which returns a larger set (or a set with duplicates), but I cannot afford to miss some elements from either `A` or `B`

Here is the strategy that I mentioned. It doesn’t seem to be better than the alternatives. I’m just posting for reference.

One question is maybe if sorting is allowed before the `union` operation or if it counts into the runtime?

The proposed strategy yields these timings

``````  # universe_size = 10^6, set_size = 10^2
Set{UInt}:                2.498 μs (10 allocations: 7.46 KiB)
BitSet:                   6.379 μs (3 allocations: 115.88 KiB)
SortedVector{UInt}:       1.525 μs (10 allocations: 7.95 KiB)
NonPreSortedVector{UInt}: 1.746 μs (10 allocations: 8.19 KiB)

# universe_size = 10^6, set_size = 10^3
Set{UInt}:                43.382 μs (11 allocations: 54.87 KiB)
BitSet:                   6.429 μs (3 allocations: 121.62 KiB)
SortedVector{UInt}:       16.411 μs (12 allocations: 54.33 KiB)
NonPreSortedVector{UInt}: 24.937 μs (16 allocations: 72.33 KiB)
``````

As you wrote @gdalle, it depends a lot on the size…

Complete code

Sorry for the ugly code, it’s just a prototype.

``````
using BenchmarkTools
function benchmark_settype(::Type{S}; universe_size, set_size) where {S}
@btime union(A, B) setup = ( #
A = \$S(rand(1:\$universe_size, \$set_size));
B = \$S(rand(1:\$universe_size, \$set_size))
)
return nothing
end

function quick_union(A, B)
a, b = A[1], B[1]
iA, iB = 1, 1
C = union(a,b)
sort!(C)
new_a = a
last_a = a

new_b = b
last_b = b

@inbounds while true
last_four = (new_a, last_a, new_b, last_b)

if a <= b && iA < length(A)
iA += 1
a = A[iA]
if !(a in last_four)
push!(C, a)
last_a, new_a = new_a, a
end
elseif iB < length(B)
iB += 1
b = B[iB]
if !(b in last_four)
push!(C, b)
last_b, new_b = new_b, b
end
else
break
end
end

return C
end

struct SortedVector{T}
data::T

function SortedVector{T}(x::Vector) where {T}
new{Vector{T}}(convert.(T,sort(x)))
end
end
Base.union(A::SortedVector, B::SortedVector) = quick_union(A,B)
Base.getindex(A::SortedVector, i) = A.data[i]
Base.length(A::SortedVector) = length(A.data)

struct NonPreSortedVector{T}
data::T
function NonPreSortedVector{T}(x::Vector) where {T}
new{Vector{T}}(convert.(T,x))
end
end
function Base.union(A::NonPreSortedVector, B::NonPreSortedVector)
sort!(A.data)
sort!(B.data)
quick_union(A,B)
end

Base.getindex(A::NonPreSortedVector, i) = A.data[i]
Base.length(A::NonPreSortedVector) = length(A.data)

benchmark_settype(Set{UInt}; universe_size = 10^6, set_size = 10^2)
benchmark_settype(BitSet; universe_size=10^6, set_size=10^2)
benchmark_settype(SortedVector{UInt}; universe_size = 10^6, set_size = 10^2)
benchmark_settype(NonPreSortedVector{UInt}; universe_size = 10^6, set_size = 10^2)

benchmark_settype(Set{UInt}; universe_size = 10^6, set_size = 10^3)
benchmark_settype(BitSet; universe_size=10^6, set_size=10^3)
benchmark_settype(SortedVector{UInt}; universe_size = 10^6, set_size = 10^3)
benchmark_settype(NonPreSortedVector{UInt}; universe_size = 10^6, set_size = 10^3)
``````

The trivial way to get speed-up by allowing false positives,
is to store in your bitsets the number modulo some largish number like 10^4 or 10^5.

Then you expand it in post to include all multiples of that number + the actual values in the bitset, are under your `universe_size`

I get

``````julia> benchmark_settype(Set{Int}; universe_size=10^6, set_size=10^3)
44.971 μs (11 allocations: 54.27 KiB)

julia> benchmark_settype(BitSet; universe_size=10^6, set_size=10^3)
7.338 μs (5 allocations: 121.66 KiB)

julia> benchmark_settype(MyIntSet{Int32}; universe_size=10^6, set_size=10^3)
3.764 μs (3 allocations: 7.88 KiB)

julia> benchmark_settype(MyIntSet{Int}; universe_size=10^6, set_size=10^3)
3.790 μs (3 allocations: 15.69 KiB)

``````

with the obvious sorted-vector handrolled implementation (keep in mind that it needs to be branchfree / ifelse / CMOV instead of branchy if/else. Otherwise make sure you benchmark right, sneaky branch history table remembers between benchmark loops). That assumes that you can keep working with sorted vectors – i.e. you never need random inserts.

``````julia> struct MyIntSet{T}
n::Nothing
v::Vector{T}
end

julia> function MyIntSet{T}(v::Vector{S}) where {T,S}
if T== S
MyIntSet{T}(nothing, sort!(v))
else
MyIntSet{T}(nothing, sort!(convert.(T, v)))
end
end

julia> function Base.union(l::MyIntSet{T}, r::MyIntSet{T}) where T
left = l.v
right = r.v
res = Vector{T}(undef, length(left) + length(right))
li = 1
ri = 1
oi = 1
until = min(length(left), length(right))
@inbounds while li <= until && ri <= until
litem = left[li]
ritem = right[ri]
ls = litem <= ritem
rs = litem >= ritem
res[oi] = ifelse(ls, litem, ritem)
oi += 1
li = ifelse(ls, li+1, li)
ri = ifelse(rs, ri+1, ri)
end
@inbounds for i=li:length(left)
res[oi] = left[i]
oi += 1
end
@inbounds for i=ri:length(right)
res[oi] = right[i]
oi += 1
end
resize!(res, oi-1)
MyIntSet(nothing, res)
end
``````
3 Likes

@foobar_lv2 Looks much nicer/faster than my code. So, but it doesn’t remove duplicates at all, e.g. it seems to be essentially a variant of `mergesort(a,b)`?

It assumes that left / right have no duplicates themselves. But if `left[i] == right[j]`, then `ls` and `rs` are both true, so both `li` and `ri` get incremented, while `oi` gets incremented only once. So the output contains no duplicates either.

But you’re right, my constructor was wrong. It must be on the lines of

``````julia> function MyIntSet{T}(v::Vector{S}) where {T,S}
if T== S
MyIntSet{T}(nothing, unique!(sort!(v)))
else
MyIntSet{T}(nothing, unique!(sort!(convert.(T, v))))
end
end
``````

where you of course should use a hand-rolled `unique!` that uses sortedness.

PS. You would have regular mergesort-merge without mutual-duplicate removal if you replaced `rs = litem >= ritem` by `rs = litem < ritem` or by `!ls`.

2 Likes

To reiterate a previously suggested idea:

The method you describe is basically a simplified Bloom filter.

Bloom filters are a cute data-structure and are described in many places. Since `union` of Bloom filters is simply ORing them, and that would take O(size of filter), you have a tradeoff between accuracy and speed.

IIRC there are some Bloom filter implementation already in Julia.
UPDATE: Probably.jl seems to include an implementation.