In the SparseConnectivityTracer.jl project, @hill and I need to efficiencly compute unions of non-disjoint sets of integers, which can be very large (thousands, even millions). The default Set{Int} is not bad, but I’m wondering if there’s a better data structure out there, perhaps in DataStructures.jl (ping @oxinabox).
My intuition is that hash sets are mainly optimized for checking x in s in O(1) amortized time. In our application, we never check x in s, but we do lots and lots of union(s1, s2). So maybe a sorted Vector{Int} would be faster? Any other clues from CS-minded people?

No answers from me, but I think for a proper nerdsnipe you’ll want to set up a little toy problem that runs in ~1s and is close to your real-life workload.

A DataStructures.SortedVector might be a suitable type for efficiently creating and working with such sorted lists.
(IIRC, under the hood it is using fancy array based tree logic)

The sets are gonna be rather sparse in practice, so Set might prevail for really large sizes.
Here’s a benchmarking code parametrized by set type:

using BenchmarkTools
function benchmark_settype(::Type{S}; universe_size, set_size) where {S}
@btime union(A, B) setup = ( #
A = $S(rand(1:$universe_size, $set_size));
B = $S(rand(1:$universe_size, $set_size))
)
return nothing
end

Its gonna be hard to bet BitSet imo.
Becuase bitwise-or is a very fast operation.
And you are only talking about millions. which means only tens of thousands of or operations
Maybe you can do something like a bitset but with an extra indexing layer on top which tracks which of the underlaying UInts in the bitset are all zero, so you can skip unioning those

Do you need it to be perfectly accurate?
If not Bloomfilter like techneques could speed it up a lot.
In proportion to how much inaccurasy you are willing to accept

I don’t necessarily need full accuracy. My code would be correct with C = union_overestimate(A, B) which returns a larger set (or a set with duplicates), but I cannot afford to miss some elements from either A or B

As you wrote @gdalle, it depends a lot on the size…

Complete code

Sorry for the ugly code, it’s just a prototype.

using BenchmarkTools
function benchmark_settype(::Type{S}; universe_size, set_size) where {S}
@btime union(A, B) setup = ( #
A = $S(rand(1:$universe_size, $set_size));
B = $S(rand(1:$universe_size, $set_size))
)
return nothing
end
function quick_union(A, B)
a, b = A[1], B[1]
iA, iB = 1, 1
C = union(a,b)
sort!(C)
new_a = a
last_a = a
new_b = b
last_b = b
@inbounds while true
last_four = (new_a, last_a, new_b, last_b)
if a <= b && iA < length(A)
iA += 1
a = A[iA]
if !(a in last_four)
push!(C, a)
last_a, new_a = new_a, a
end
elseif iB < length(B)
iB += 1
b = B[iB]
if !(b in last_four)
push!(C, b)
last_b, new_b = new_b, b
end
else
break
end
end
return C
end
struct SortedVector{T}
data::T
function SortedVector{T}(x::Vector) where {T}
new{Vector{T}}(convert.(T,sort(x)))
end
end
Base.union(A::SortedVector, B::SortedVector) = quick_union(A,B)
Base.getindex(A::SortedVector, i) = A.data[i]
Base.length(A::SortedVector) = length(A.data)
struct NonPreSortedVector{T}
data::T
function NonPreSortedVector{T}(x::Vector) where {T}
new{Vector{T}}(convert.(T,x))
end
end
function Base.union(A::NonPreSortedVector, B::NonPreSortedVector)
sort!(A.data)
sort!(B.data)
quick_union(A,B)
end
Base.getindex(A::NonPreSortedVector, i) = A.data[i]
Base.length(A::NonPreSortedVector) = length(A.data)
benchmark_settype(Set{UInt}; universe_size = 10^6, set_size = 10^2)
benchmark_settype(BitSet; universe_size=10^6, set_size=10^2)
benchmark_settype(SortedVector{UInt}; universe_size = 10^6, set_size = 10^2)
benchmark_settype(NonPreSortedVector{UInt}; universe_size = 10^6, set_size = 10^2)
benchmark_settype(Set{UInt}; universe_size = 10^6, set_size = 10^3)
benchmark_settype(BitSet; universe_size=10^6, set_size=10^3)
benchmark_settype(SortedVector{UInt}; universe_size = 10^6, set_size = 10^3)
benchmark_settype(NonPreSortedVector{UInt}; universe_size = 10^6, set_size = 10^3)

with the obvious sorted-vector handrolled implementation (keep in mind that it needs to be branchfree / ifelse / CMOV instead of branchy if/else. Otherwise make sure you benchmark right, sneaky branch history table remembers between benchmark loops). That assumes that you can keep working with sorted vectors – i.e. you never need random inserts.

julia> struct MyIntSet{T}
n::Nothing
v::Vector{T}
end
julia> function MyIntSet{T}(v::Vector{S}) where {T,S}
if T== S
MyIntSet{T}(nothing, sort!(v))
else
MyIntSet{T}(nothing, sort!(convert.(T, v)))
end
end
julia> function Base.union(l::MyIntSet{T}, r::MyIntSet{T}) where T
left = l.v
right = r.v
res = Vector{T}(undef, length(left) + length(right))
li = 1
ri = 1
oi = 1
until = min(length(left), length(right))
@inbounds while li <= until && ri <= until
litem = left[li]
ritem = right[ri]
ls = litem <= ritem
rs = litem >= ritem
res[oi] = ifelse(ls, litem, ritem)
oi += 1
li = ifelse(ls, li+1, li)
ri = ifelse(rs, ri+1, ri)
end
@inbounds for i=li:length(left)
res[oi] = left[i]
oi += 1
end
@inbounds for i=ri:length(right)
res[oi] = right[i]
oi += 1
end
resize!(res, oi-1)
MyIntSet(nothing, res)
end

@foobar_lv2 Looks much nicer/faster than my code. So, but it doesn’t remove duplicates at all, e.g. it seems to be essentially a variant of mergesort(a,b)?

It assumes that left / right have no duplicates themselves. But if left[i] == right[j], then ls and rs are both true, so both li and ri get incremented, while oi gets incremented only once. So the output contains no duplicates either.

But you’re right, my constructor was wrong. It must be on the lines of

julia> function MyIntSet{T}(v::Vector{S}) where {T,S}
if T== S
MyIntSet{T}(nothing, unique!(sort!(v)))
else
MyIntSet{T}(nothing, unique!(sort!(convert.(T, v))))
end
end

where you of course should use a hand-rolled unique! that uses sortedness.

PS. You would have regular mergesort-merge without mutual-duplicate removal if you replaced rs = litem >= ritem by rs = litem < ritem or by !ls.

The method you describe is basically a simplified Bloom filter.

Bloom filters are a cute data-structure and are described in many places. Since union of Bloom filters is simply ORing them, and that would take O(size of filter), you have a tradeoff between accuracy and speed.

IIRC there are some Bloom filter implementation already in Julia.
UPDATE: Probably.jl seems to include an implementation.