Best data structure for fast unions of large sets of integers

Nerdsnipe alert

In the SparseConnectivityTracer.jl project, @hill and I need to efficiencly compute unions of non-disjoint sets of integers, which can be very large (thousands, even millions). The default Set{Int} is not bad, but I’m wondering if there’s a better data structure out there, perhaps in DataStructures.jl (ping @oxinabox).
My intuition is that hash sets are mainly optimized for checking x in s in O(1) amortized time. In our application, we never check x in s, but we do lots and lots of union(s1, s2). So maybe a sorted Vector{Int} would be faster? Any other clues from CS-minded people?

3 Likes

No answers from me, but I think for a proper nerdsnipe you’ll want to set up a little toy problem that runs in ~1s and is close to your real-life workload.

5 Likes

I had to do something similar and ended up using two sorted lists. (I couldn’t find a perfectly suiting datatype back then.)

Just iterate through them in parallel which means you can find the union in O(n+m) time (+ the sorting runtime).

1 Like

A DataStructures.SortedVector might be a suitable type for efficiently creating and working with such sorted lists.
(IIRC, under the hood it is using fancy array based tree logic)

2 Likes

Mh, any particular reason why a more minimal example doesn’t apply? Like:

N, M = 100_000, 200_000
X = 1:1_000_000
A, B = rand(X, N), rand(X, M)
union(A,B)

It’s a bit difficult to track back what exactly needs to be optimized in TracerSparsityDetector

1 Like

That minimal example does essentially what I want indeed. Gonna delete the full use case

BitSet comes to mind. If the sets are reasonably dense, I’m not sure it’s possible to do much better.

3 Likes

The sets are gonna be rather sparse in practice, so Set might prevail for really large sizes.
Here’s a benchmarking code parametrized by set type:

using BenchmarkTools

function benchmark_settype(::Type{S}; universe_size, set_size) where {S}
    @btime union(A, B) setup = ( #
        A = $S(rand(1:$universe_size, $set_size));
        B = $S(rand(1:$universe_size, $set_size))
    )
    return nothing
end

And some results on my machine:

julia> benchmark_settype(Set{UInt}; universe_size=10^6, set_size=10^3)
  55.236 μs (11 allocations: 54.87 KiB)

julia> benchmark_settype(BitSet; universe_size=10^6, set_size=10^3)
  11.826 μs (3 allocations: 121.75 KiB)
2 Likes

Where do you find it in DataStructures.jl? The only mention I see is an unmerged PR

1 Like

Because it is fast and easy i did the timing for the DataStructures types

julia> benchmark_settype(Set{UInt}; universe_size=10^6, set_size=10^3)
  21.479 μs (11 allocations: 54.27 KiB)

julia> benchmark_settype(BitSet; universe_size=10^6, set_size=10^3)
  8.531 μs (7 allocations: 387.44 KiB)

julia> benchmark_settype(Vector{UInt}; universe_size=10^6, set_size=10^3)
  26.179 μs (30 allocations: 102.38 KiB)

julia> benchmark_settype(DataStructures.SortedSet{UInt}; universe_size=10^6, set_size=10^3)
  131.859 μs (35 allocations: 237.55 KiB)

julia> benchmark_settype(DataStructures.DisjointSets{UInt}; universe_size=10^6, set_size=10^3)
  15.380 μs (11 allocations: 54.27 KiB)

julia> benchmark_settype(DataStructures.IntSet; universe_size=10^6, set_size=10^3)
  19.959 μs (10 allocations: 183.72 KiB)

Its gonna be hard to bet BitSet imo.
Becuase bitwise-or is a very fast operation.
And you are only talking about millions. which means only tens of thousands of or operations
Maybe you can do something like a bitset but with an extra indexing layer on top which tracks which of the underlaying UInts in the bitset are all zero, so you can skip unioning those

3 Likes

It seems to really depend on the sparsity.
The previous benchmarks were with 10^{-3} sparsity. If instead I do 10^{-4} the leaderboard changes:

julia> benchmark_settype(Set{UInt}; universe_size=10^6, set_size=10^2)
  4.742 μs (10 allocations: 7.46 KiB)

julia> benchmark_settype(BitSet; universe_size=10^6, set_size=10^2)
  11.279 μs (3 allocations: 116.19 KiB)
2 Likes

To add nuance to this problem: the set types are allowed to know the range of values 1:universe_size they could contain.

1 Like

Do you need it to be perfectly accurate?
If not Bloomfilter like techneques could speed it up a lot.
In proportion to how much inaccurasy you are willing to accept

1 Like

I don’t necessarily need full accuracy. My code would be correct with C = union_overestimate(A, B) which returns a larger set (or a set with duplicates), but I cannot afford to miss some elements from either A or B

Here is the strategy that I mentioned. It doesn’t seem to be better than the alternatives. I’m just posting for reference.

One question is maybe if sorting is allowed before the union operation or if it counts into the runtime?

The proposed strategy yields these timings

  # universe_size = 10^6, set_size = 10^2
  Set{UInt}:                2.498 μs (10 allocations: 7.46 KiB)
  BitSet:                   6.379 μs (3 allocations: 115.88 KiB)
  SortedVector{UInt}:       1.525 μs (10 allocations: 7.95 KiB)
  NonPreSortedVector{UInt}: 1.746 μs (10 allocations: 8.19 KiB)

  # universe_size = 10^6, set_size = 10^3
  Set{UInt}:                43.382 μs (11 allocations: 54.87 KiB)
  BitSet:                   6.429 μs (3 allocations: 121.62 KiB)
  SortedVector{UInt}:       16.411 μs (12 allocations: 54.33 KiB)
  NonPreSortedVector{UInt}: 24.937 μs (16 allocations: 72.33 KiB)

As you wrote @gdalle, it depends a lot on the size…

Complete code

Sorry for the ugly code, it’s just a prototype.


using BenchmarkTools
function benchmark_settype(::Type{S}; universe_size, set_size) where {S}
    @btime union(A, B) setup = ( #
        A = $S(rand(1:$universe_size, $set_size));
        B = $S(rand(1:$universe_size, $set_size))
    )
    return nothing
end

function quick_union(A, B)
    a, b = A[1], B[1]
    iA, iB = 1, 1
    C = union(a,b)
    sort!(C)
    new_a = a 
    last_a = a 

    new_b = b 
    last_b = b

    @inbounds while true
        last_four = (new_a, last_a, new_b, last_b)

        if a <= b && iA < length(A)
            iA += 1
            a = A[iA]
            if !(a in last_four)
                push!(C, a)
                last_a, new_a = new_a, a
            end
        elseif iB < length(B)
            iB += 1
            b = B[iB]
            if !(b in last_four)
                push!(C, b)
                last_b, new_b = new_b, b
            end
        else 
            break 
        end
    end

    return C 
end



struct SortedVector{T}
    data::T

    function SortedVector{T}(x::Vector) where {T}
        new{Vector{T}}(convert.(T,sort(x)))
    end
end
Base.union(A::SortedVector, B::SortedVector) = quick_union(A,B)
Base.getindex(A::SortedVector, i) = A.data[i]
Base.length(A::SortedVector) = length(A.data)

struct NonPreSortedVector{T}
    data::T
    function NonPreSortedVector{T}(x::Vector) where {T}
        new{Vector{T}}(convert.(T,x))
    end
end
function Base.union(A::NonPreSortedVector, B::NonPreSortedVector) 
    sort!(A.data)
    sort!(B.data)
    quick_union(A,B)
end

Base.getindex(A::NonPreSortedVector, i) = A.data[i]
Base.length(A::NonPreSortedVector) = length(A.data)


benchmark_settype(Set{UInt}; universe_size = 10^6, set_size = 10^2)
benchmark_settype(BitSet; universe_size=10^6, set_size=10^2)
benchmark_settype(SortedVector{UInt}; universe_size = 10^6, set_size = 10^2)
benchmark_settype(NonPreSortedVector{UInt}; universe_size = 10^6, set_size = 10^2)

benchmark_settype(Set{UInt}; universe_size = 10^6, set_size = 10^3)
benchmark_settype(BitSet; universe_size=10^6, set_size=10^3)
benchmark_settype(SortedVector{UInt}; universe_size = 10^6, set_size = 10^3)
benchmark_settype(NonPreSortedVector{UInt}; universe_size = 10^6, set_size = 10^3)

The trivial way to get speed-up by allowing false positives,
is to store in your bitsets the number modulo some largish number like 10^4 or 10^5.

Then you expand it in post to include all multiples of that number + the actual values in the bitset, are under your universe_size

I get

julia> benchmark_settype(Set{Int}; universe_size=10^6, set_size=10^3)
  44.971 μs (11 allocations: 54.27 KiB)

julia> benchmark_settype(BitSet; universe_size=10^6, set_size=10^3)
  7.338 μs (5 allocations: 121.66 KiB)

julia> benchmark_settype(MyIntSet{Int32}; universe_size=10^6, set_size=10^3)
  3.764 μs (3 allocations: 7.88 KiB)

julia> benchmark_settype(MyIntSet{Int}; universe_size=10^6, set_size=10^3)
  3.790 μs (3 allocations: 15.69 KiB)

with the obvious sorted-vector handrolled implementation (keep in mind that it needs to be branchfree / ifelse / CMOV instead of branchy if/else. Otherwise make sure you benchmark right, sneaky branch history table remembers between benchmark loops). That assumes that you can keep working with sorted vectors – i.e. you never need random inserts.

julia> struct MyIntSet{T}
       n::Nothing
       v::Vector{T}
       end

julia> function MyIntSet{T}(v::Vector{S}) where {T,S}
                    if T== S
              MyIntSet{T}(nothing, sort!(v))
              else
              MyIntSet{T}(nothing, sort!(convert.(T, v)))
              end
              end

julia> function Base.union(l::MyIntSet{T}, r::MyIntSet{T}) where T
       left = l.v
       right = r.v
       res = Vector{T}(undef, length(left) + length(right))
       li = 1
       ri = 1
       oi = 1
       until = min(length(left), length(right))
       @inbounds while li <= until && ri <= until
       litem = left[li]
       ritem = right[ri]
       ls = litem <= ritem
       rs = litem >= ritem
       res[oi] = ifelse(ls, litem, ritem)
       oi += 1
       li = ifelse(ls, li+1, li)
       ri = ifelse(rs, ri+1, ri)
       end
       @inbounds for i=li:length(left)
       res[oi] = left[i]
       oi += 1
       end
       @inbounds for i=ri:length(right)
       res[oi] = right[i]
       oi += 1
       end
       resize!(res, oi-1)
       MyIntSet(nothing, res)
       end
3 Likes

@foobar_lv2 Looks much nicer/faster than my code. So, but it doesn’t remove duplicates at all, e.g. it seems to be essentially a variant of mergesort(a,b)?

It assumes that left / right have no duplicates themselves. But if left[i] == right[j], then ls and rs are both true, so both li and ri get incremented, while oi gets incremented only once. So the output contains no duplicates either.

But you’re right, my constructor was wrong. It must be on the lines of

julia> function MyIntSet{T}(v::Vector{S}) where {T,S}
                    if T== S
              MyIntSet{T}(nothing, unique!(sort!(v)))
              else
              MyIntSet{T}(nothing, unique!(sort!(convert.(T, v))))
              end
              end

where you of course should use a hand-rolled unique! that uses sortedness.

PS. You would have regular mergesort-merge without mutual-duplicate removal if you replaced rs = litem >= ritem by rs = litem < ritem or by !ls.

2 Likes

To reiterate a previously suggested idea:

The method you describe is basically a simplified Bloom filter.

Bloom filters are a cute data-structure and are described in many places. Since union of Bloom filters is simply ORing them, and that would take O(size of filter), you have a tradeoff between accuracy and speed.

IIRC there are some Bloom filter implementation already in Julia.
UPDATE: Probably.jl seems to include an implementation.