Best data structure for fast unions of large sets of integers

oxinabox · May 3, 2024, 12:36pm

Yes, I said that in an earlier comment

gdalle · May 3, 2024, 12:38pm

@foobar_lv2 yeah that’s what I had in mind, here’s my version which leverages already_sorted:

struct SortedVector{T}
    data::Vector{T}
end

function SortedVector(data::Vector{T}; already_sorted=false) where {T}
    if already_sorted
        SortedVector{T}(data)
    else
        SortedVector{T}(sort(data))
    end
end

function Base.union(v1::SortedVector{T}, v2::SortedVector{T}) where {T}
    x, y = v1.data, v2.data
    z =  # do the merge sort removing duplicates
    return SortedVector(z; already_sorted=true)
end

Dan · May 3, 2024, 12:39pm

Sorry, reading these fast threads carefully is sometimes challenging. I’ll like that post!

gdalle · May 3, 2024, 12:40pm

If my implementation is correct it’s also super fast
EDIT: dear reader, it was not correct

Dan · May 3, 2024, 12:43pm

Another idea: Just do it lazily.
If queries are done after many unions. Perhaps just keeping the collection of Sets to unionize in a vector, and later unionizing them (when a membership query is started).

This could allow doing the union in binary tree-like reduction fashion over the sets (or even in parallel).

foobar_lv2 · May 3, 2024, 12:44pm

Generally, the branchfree mergesort variant is quick to write down and somewhat okayish in performance.

But it is really latency-bound on all the CMOV, so has terrible cpu utilization. You should definitely enable hyperthreading / SMT / use a thread per logical core, not per physical core.

Ideally, one would SIMD this. You can take a look at e.g. https://repository.gatech.edu/bitstreams/e2d56fad-0062-467f-a0d9-a0d39191cbb6/download for two approaches:

The obvious but annoying approach, that is also taken in the linked article: Use SIMD like multi-threading, a la ispc (needs avx512).
Fancy sorting networks, you can take a look at the intro and references in the article (should generally work on most architectures).

Either way is probably more pain than you want. I can confidently say that it is more pain than I want, but if someone made a package I’d be happy to use & adapt

gdalle · May 3, 2024, 12:46pm

Can’t I just borrow this step (which is indeed the “merge” in merge sort) from somewhere, like Base?

foobar_lv2 · May 3, 2024, 12:52pm

Copy-paste my code? Feel free to take it.

This is not exactly the merge from mergesort, since the litem == ritem case is handled differently (mergesort merge needs to insert the item twice, we need to insert it only once).

Maybe better to copy-paste-modify base code, then you can drop a link to where it is adapted from.

gdalle · May 3, 2024, 12:58pm

Thanks for the help!

foobar_lv2 · May 3, 2024, 1:32pm

Ok, I took another look. It’s not as bad as I thought. Example code is CRoaring/src/array_util.c at master · RoaringBitmap/CRoaring · GitHub and the union_vector16 function for AVX2. The code is super cool!

So if you care a lot about performance on this, maybe ccall if possible? Adapting to Int32 vectors might be a pain.

This is the kind of code that is better written in C than julia: not really C-as-a-language, rather C-as-macro-assembler with lots of x86 simd intrinsics that have no target-independent llvm spelling, like vector-shuffle as lookup table.

gdalle · May 3, 2024, 2:44pm

So here are the updated benchmarks with the hopefully correct adaptation of @foobar_lv2’s code, which I called SortedVector{T,V}:

using BenchmarkTools, Printf

function benchmark_settype(::Type{S}; universe_size, set_size) where {S}
    T = eltype(S)
    universe = T(1):T(universe_size)
    return @belapsed union(A, B) setup = ( #
        A = $S(rand($universe, $set_size));
        B = $S(rand($universe, $set_size))
    ) seconds = 1
end

for p in 1:5, S in (BitSet, Set{UInt}, SortedVector{UInt,Vector{UInt}})
    set_size, universe_size = 10^p, 10^6
    time = round(benchmark_settype(S; universe_size, set_size); sigdigits=3)
    @info "$S $(" "^(40 - length(string(S)))) - 10^$p / 10^6 - $(@sprintf("%.1e", time)) s"
end

[ Info: BitSet                                    - 10^1 / 10^6 - 8.8e-06 s
[ Info: Set{UInt64}                               - 10^1 / 10^6 - 4.2e-07 s
[ Info: SortedVector{UInt64, Vector{UInt64}}      - 10^1 / 10^6 - 6.0e-08 s
[ Info: BitSet                                    - 10^2 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64}                               - 10^2 / 10^6 - 4.0e-06 s
[ Info: SortedVector{UInt64, Vector{UInt64}}      - 10^2 / 10^6 - 4.8e-07 s
[ Info: BitSet                                    - 10^3 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64}                               - 10^3 / 10^6 - 5.4e-05 s
[ Info: SortedVector{UInt64, Vector{UInt64}}      - 10^3 / 10^6 - 4.6e-06 s
[ Info: BitSet                                    - 10^4 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64}                               - 10^4 / 10^6 - 5.3e-04 s
[ Info: SortedVector{UInt64, Vector{UInt64}}      - 10^4 / 10^6 - 4.5e-05 s
[ Info: BitSet                                    - 10^5 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64}                               - 10^5 / 10^6 - 5.5e-03 s
[ Info: SortedVector{UInt64, Vector{UInt64}}      - 10^5 / 10^6 - 4.4e-04 s

Dan · May 3, 2024, 3:20pm

Chart view of these results:
BitSet, Set{UInt64} and SortedVector{UInt64 Vector{UInt64}}

gdalle · May 3, 2024, 3:23pm

How did you plot this? SortedVector is uniformly faster, right?

Dan · May 3, 2024, 3:24pm

Took the data from your post, and plugged it into a pivot table on Google sheets.
Yes SortedVector uniformly faster.

If one of you (plural) is a ChatGPT / Opus sub, then having AI generate this task is a good benchmark. Am curious as to success and prompt method.

gdalle · May 3, 2024, 3:27pm

Oh right, it’s uniformly faster than Set but not than BitSet, I hadn’t realized

Thanks for the plotting!

Tarny_GG_Channie · May 4, 2024, 11:02am

Okay, I may not be very good at software engineering but algorithms are something I’m reasonably good at. So, let me take a look.
Maybe I’m late here, but what do you need it to do? Do you need to collect the union as array? Do you need to test if an element is in A or B ever? Do you ever need to collect all the elements as vector or iterate through it? What operations do you need the set to have? A data structure can be designed when the required operations are given. You said you need fast union. What else do you need? Do you need the original sets? Does the sparsity have a pattern(eg: the set is sparse but elements are grouped together like [1,2,3,1e6,1e6+2,1e6+3])?

A typical data structure lists all the supported operations. For example, an AVL tree has insertion, deletion, traversal, and searching for an element that matches or is just above/below it. However, for integers with fixed size (eg 32 bits integer), a Van Emde Boas tree might be faster.

gdalle · May 4, 2024, 11:16am

As stated in the original post, I need to

I have no workable assumptions on the structure of these sets. I only know the maximum possible value universe_size, and that the set_size will be small compared to that (but I don’t know small, whether it’s 1/10 or 1/1000).

I never need to test x in s. I don’t care about collecting to an array, as long as I can iterate. The only thing I care about are fast unions.

Tarny_GG_Channie · May 4, 2024, 11:54am

You need to iterate through a union? Do you need the union of multiple sets, or just a pair? Can you tolerate duplicate?
TBH… I was thinking of just keeping a collection of sets and iterating through it.
If there is only a pair of sets, iterating through it is easy. Just iterate through the first set, then the second set, while checking whether it is in the first set…
If you need to iterate through a union of multiple sets, or a union of union, then you keep a list or a set of the sets you have in a union. Then, when you need to iterate, just iterate twice. If you need to filter out duplicates, maybe use bitset or hashset to do so. If you can accept some duplicates, maybe a simplified hashset where you delete the collided element whenever a collision happens might work. How much auxiliary space do you have? Yeah… There are many options.

sumiya11 · May 4, 2024, 11:55am

Might be worth to try C = vcat(A, B) then

Dan · May 4, 2024, 12:53pm

What is the distribution of elements on the big universe? If it isn’t uniform, then perhaps splitting into ‘heavy hitters’ and rare elements is useful.

On the whole, it appears a Bloom filter will be a good data structure, combined with a vector of elements (filters don’t allow easy iteration).

Topic		Replies	Views
Can I speed up this recursive set implementation? Performance performance , set , autodiff , nerdsnipe	38	688	May 10, 2024
Sparse set-like indexible data structure General Usage question	9	878	October 19, 2017
Complexity/implementation of disjoint-set (union-find) in `DataStructures.jl` Performance question , documentation , datastructures	4	260	September 11, 2024
Slow code with Union and Box General Usage	17	293	March 6, 2024
True power set of a set is slower than power array of array Performance	3	715	April 17, 2021

Best data structure for fast unions of large sets of integers

Related topics