Best data structure for fast unions of large sets of integers

Yes, I said that in an earlier comment

1 Like

@foobar_lv2 yeah that’s what I had in mind, here’s my version which leverages already_sorted:

struct SortedVector{T}

function SortedVector(data::Vector{T}; already_sorted=false) where {T}
    if already_sorted

function Base.union(v1::SortedVector{T}, v2::SortedVector{T}) where {T}
    x, y =,
    z =  # do the merge sort removing duplicates
    return SortedVector(z; already_sorted=true)
1 Like

Sorry, reading these fast threads carefully is sometimes challenging. I’ll like that post!

If my implementation is correct it’s also super fast
EDIT: dear reader, it was not correct


Another idea: Just do it lazily.
If queries are done after many unions. Perhaps just keeping the collection of Sets to unionize in a vector, and later unionizing them (when a membership query is started).

This could allow doing the union in binary tree-like reduction fashion over the sets (or even in parallel).

Generally, the branchfree mergesort variant is quick to write down and somewhat okayish in performance.

But it is really latency-bound on all the CMOV, so has terrible cpu utilization. You should definitely enable hyperthreading / SMT / use a thread per logical core, not per physical core.

Ideally, one would SIMD this. You can take a look at e.g. for two approaches:

  1. The obvious but annoying approach, that is also taken in the linked article: Use SIMD like multi-threading, a la ispc (needs avx512).
  2. Fancy sorting networks, you can take a look at the intro and references in the article (should generally work on most architectures).

Either way is probably more pain than you want. I can confidently say that it is more pain than I want, but if someone made a package I’d be happy to use & adapt :wink:


Can’t I just borrow this step (which is indeed the “merge” in merge sort) from somewhere, like Base?

Copy-paste my code? Feel free to take it.

This is not exactly the merge from mergesort, since the litem == ritem case is handled differently (mergesort merge needs to insert the item twice, we need to insert it only once).

Maybe better to copy-paste-modify base code, then you can drop a link to where it is adapted from.

1 Like

Thanks for the help!

Ok, I took another look. It’s not as bad as I thought. Example code is CRoaring/src/array_util.c at master · RoaringBitmap/CRoaring · GitHub and the union_vector16 function for AVX2. The code is super cool!

So if you care a lot about performance on this, maybe ccall if possible? Adapting to Int32 vectors might be a pain.

This is the kind of code that is better written in C than julia: not really C-as-a-language, rather C-as-macro-assembler with lots of x86 simd intrinsics that have no target-independent llvm spelling, like vector-shuffle as lookup table.


So here are the updated benchmarks with the hopefully correct adaptation of @foobar_lv2’s code, which I called SortedVector{T,V}:

using BenchmarkTools, Printf

function benchmark_settype(::Type{S}; universe_size, set_size) where {S}
    T = eltype(S)
    universe = T(1):T(universe_size)
    return @belapsed union(A, B) setup = ( #
        A = $S(rand($universe, $set_size));
        B = $S(rand($universe, $set_size))
    ) seconds = 1

for p in 1:5, S in (BitSet, Set{UInt}, SortedVector{UInt,Vector{UInt}})
    set_size, universe_size = 10^p, 10^6
    time = round(benchmark_settype(S; universe_size, set_size); sigdigits=3)
    @info "$S $(" "^(40 - length(string(S)))) - 10^$p / 10^6 - $(@sprintf("%.1e", time)) s"
[ Info: BitSet                                    - 10^1 / 10^6 - 8.8e-06 s
[ Info: Set{UInt64}                               - 10^1 / 10^6 - 4.2e-07 s
[ Info: SortedVector{UInt64, Vector{UInt64}}      - 10^1 / 10^6 - 6.0e-08 s
[ Info: BitSet                                    - 10^2 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64}                               - 10^2 / 10^6 - 4.0e-06 s
[ Info: SortedVector{UInt64, Vector{UInt64}}      - 10^2 / 10^6 - 4.8e-07 s
[ Info: BitSet                                    - 10^3 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64}                               - 10^3 / 10^6 - 5.4e-05 s
[ Info: SortedVector{UInt64, Vector{UInt64}}      - 10^3 / 10^6 - 4.6e-06 s
[ Info: BitSet                                    - 10^4 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64}                               - 10^4 / 10^6 - 5.3e-04 s
[ Info: SortedVector{UInt64, Vector{UInt64}}      - 10^4 / 10^6 - 4.5e-05 s
[ Info: BitSet                                    - 10^5 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64}                               - 10^5 / 10^6 - 5.5e-03 s
[ Info: SortedVector{UInt64, Vector{UInt64}}      - 10^5 / 10^6 - 4.4e-04 s

Chart view of these results:
BitSet, Set{UInt64} and SortedVector{UInt64 Vector{UInt64}}


How did you plot this? SortedVector is uniformly faster, right?

Took the data from your post, and plugged it into a pivot table on Google sheets.
Yes SortedVector uniformly faster.

If one of you (plural) is a ChatGPT / Opus sub, then having AI generate this task is a good benchmark. Am curious as to success and prompt method.

1 Like

Oh right, it’s uniformly faster than Set but not than BitSet, I hadn’t realized

Thanks for the plotting!

1 Like

Okay, I may not be very good at software engineering but algorithms are something I’m reasonably good at. So, let me take a look.
Maybe I’m late here, but what do you need it to do? Do you need to collect the union as array? Do you need to test if an element is in A or B ever? Do you ever need to collect all the elements as vector or iterate through it? What operations do you need the set to have? A data structure can be designed when the required operations are given. You said you need fast union. What else do you need? Do you need the original sets? Does the sparsity have a pattern(eg: the set is sparse but elements are grouped together like [1,2,3,1e6,1e6+2,1e6+3])?

A typical data structure lists all the supported operations. For example, an AVL tree has insertion, deletion, traversal, and searching for an element that matches or is just above/below it. However, for integers with fixed size (eg 32 bits integer), a Van Emde Boas tree might be faster.

As stated in the original post, I need to

I have no workable assumptions on the structure of these sets. I only know the maximum possible value universe_size, and that the set_size will be small compared to that (but I don’t know small, whether it’s 1/10 or 1/1000).

I never need to test x in s. I don’t care about collecting to an array, as long as I can iterate. The only thing I care about are fast unions.

You need to iterate through a union? Do you need the union of multiple sets, or just a pair? Can you tolerate duplicate?
TBH… I was thinking of just keeping a collection of sets and iterating through it.
If there is only a pair of sets, iterating through it is easy. Just iterate through the first set, then the second set, while checking whether it is in the first set…
If you need to iterate through a union of multiple sets, or a union of union, then you keep a list or a set of the sets you have in a union. Then, when you need to iterate, just iterate twice. If you need to filter out duplicates, maybe use bitset or hashset to do so. If you can accept some duplicates, maybe a simplified hashset where you delete the collided element whenever a collision happens might work. How much auxiliary space do you have? Yeah… There are many options.

Might be worth to try C = vcat(A, B) then


What is the distribution of elements on the big universe? If it isn’t uniform, then perhaps splitting into ‘heavy hitters’ and rare elements is useful.

On the whole, it appears a Bloom filter will be a good data structure, combined with a vector of elements (filters don’t allow easy iteration).

1 Like