Yes, I said that in an earlier comment
@foobar_lv2 yeah thatās what I had in mind, hereās my version which leverages already_sorted
:
struct SortedVector{T}
data::Vector{T}
end
function SortedVector(data::Vector{T}; already_sorted=false) where {T}
if already_sorted
SortedVector{T}(data)
else
SortedVector{T}(sort(data))
end
end
function Base.union(v1::SortedVector{T}, v2::SortedVector{T}) where {T}
x, y = v1.data, v2.data
z = # do the merge sort removing duplicates
return SortedVector(z; already_sorted=true)
end
Sorry, reading these fast threads carefully is sometimes challenging. Iāll like that post!
If my implementation is correct itās also super fast
EDIT: dear reader, it was not correct
Another idea: Just do it lazily.
If queries are done after many unions. Perhaps just keeping the collection of Sets to unionize in a vector, and later unionizing them (when a membership query is started).
This could allow doing the union
in binary tree-like reduction fashion over the sets (or even in parallel).
Generally, the branchfree mergesort variant is quick to write down and somewhat okayish in performance.
But it is really latency-bound on all the CMOV, so has terrible cpu utilization. You should definitely enable hyperthreading / SMT / use a thread per logical core, not per physical core.
Ideally, one would SIMD this. You can take a look at e.g. https://repository.gatech.edu/bitstreams/e2d56fad-0062-467f-a0d9-a0d39191cbb6/download for two approaches:
- The obvious but annoying approach, that is also taken in the linked article: Use SIMD like multi-threading, a la ispc (needs avx512).
- Fancy sorting networks, you can take a look at the intro and references in the article (should generally work on most architectures).
Either way is probably more pain than you want. I can confidently say that it is more pain than I want, but if someone made a package Iād be happy to use & adapt
Canāt I just borrow this step (which is indeed the āmergeā in merge sort) from somewhere, like Base?
Copy-paste my code? Feel free to take it.
This is not exactly the merge from mergesort, since the litem == ritem
case is handled differently (mergesort merge needs to insert the item twice, we need to insert it only once).
Maybe better to copy-paste-modify base code, then you can drop a link to where it is adapted from.
Thanks for the help!
Ok, I took another look. Itās not as bad as I thought. Example code is CRoaring/src/array_util.c at master Ā· RoaringBitmap/CRoaring Ā· GitHub and the union_vector16
function for AVX2. The code is super cool!
So if you care a lot about performance on this, maybe ccall
if possible? Adapting to Int32 vectors might be a pain.
This is the kind of code that is better written in C than julia: not really C-as-a-language, rather C-as-macro-assembler with lots of x86 simd intrinsics that have no target-independent llvm spelling, like vector-shuffle as lookup table.
So here are the updated benchmarks with the hopefully correct adaptation of @foobar_lv2ās code, which I called SortedVector{T,V}
:
using BenchmarkTools, Printf
function benchmark_settype(::Type{S}; universe_size, set_size) where {S}
T = eltype(S)
universe = T(1):T(universe_size)
return @belapsed union(A, B) setup = ( #
A = $S(rand($universe, $set_size));
B = $S(rand($universe, $set_size))
) seconds = 1
end
for p in 1:5, S in (BitSet, Set{UInt}, SortedVector{UInt,Vector{UInt}})
set_size, universe_size = 10^p, 10^6
time = round(benchmark_settype(S; universe_size, set_size); sigdigits=3)
@info "$S $(" "^(40 - length(string(S)))) - 10^$p / 10^6 - $(@sprintf("%.1e", time)) s"
end
[ Info: BitSet - 10^1 / 10^6 - 8.8e-06 s
[ Info: Set{UInt64} - 10^1 / 10^6 - 4.2e-07 s
[ Info: SortedVector{UInt64, Vector{UInt64}} - 10^1 / 10^6 - 6.0e-08 s
[ Info: BitSet - 10^2 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64} - 10^2 / 10^6 - 4.0e-06 s
[ Info: SortedVector{UInt64, Vector{UInt64}} - 10^2 / 10^6 - 4.8e-07 s
[ Info: BitSet - 10^3 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64} - 10^3 / 10^6 - 5.4e-05 s
[ Info: SortedVector{UInt64, Vector{UInt64}} - 10^3 / 10^6 - 4.6e-06 s
[ Info: BitSet - 10^4 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64} - 10^4 / 10^6 - 5.3e-04 s
[ Info: SortedVector{UInt64, Vector{UInt64}} - 10^4 / 10^6 - 4.5e-05 s
[ Info: BitSet - 10^5 / 10^6 - 1.1e-05 s
[ Info: Set{UInt64} - 10^5 / 10^6 - 5.5e-03 s
[ Info: SortedVector{UInt64, Vector{UInt64}} - 10^5 / 10^6 - 4.4e-04 s
Chart view of these results:
How did you plot this? SortedVector
is uniformly faster, right?
Took the data from your post, and plugged it into a pivot table on Google sheets.
Yes SortedVector uniformly faster.
If one of you (plural) is a ChatGPT / Opus sub, then having AI generate this task is a good benchmark. Am curious as to success and prompt method.
Oh right, itās uniformly faster than Set
but not than BitSet
, I hadnāt realized
Thanks for the plotting!
Okay, I may not be very good at software engineering but algorithms are something Iām reasonably good at. So, let me take a look.
Maybe Iām late here, but what do you need it to do? Do you need to collect the union as array? Do you need to test if an element is in A or B ever? Do you ever need to collect all the elements as vector or iterate through it? What operations do you need the set to have? A data structure can be designed when the required operations are given. You said you need fast union. What else do you need? Do you need the original sets? Does the sparsity have a pattern(eg: the set is sparse but elements are grouped together like [1,2,3,1e6,1e6+2,1e6+3])?
A typical data structure lists all the supported operations. For example, an AVL tree has insertion, deletion, traversal, and searching for an element that matches or is just above/below it. However, for integers with fixed size (eg 32 bits integer), a Van Emde Boas tree might be faster.
As stated in the original post, I need to
I have no workable assumptions on the structure of these sets. I only know the maximum possible value universe_size
, and that the set_size
will be small compared to that (but I donāt know small, whether itās 1/10 or 1/1000).
I never need to test x in s
. I donāt care about collecting to an array, as long as I can iterate. The only thing I care about are fast unions.
You need to iterate through a union? Do you need the union of multiple sets, or just a pair? Can you tolerate duplicate?
TBHā¦ I was thinking of just keeping a collection of sets and iterating through it.
If there is only a pair of sets, iterating through it is easy. Just iterate through the first set, then the second set, while checking whether it is in the first setā¦
If you need to iterate through a union of multiple sets, or a union of union, then you keep a list or a set of the sets you have in a union. Then, when you need to iterate, just iterate twice. If you need to filter out duplicates, maybe use bitset or hashset to do so. If you can accept some duplicates, maybe a simplified hashset where you delete the collided element whenever a collision happens might work. How much auxiliary space do you have? Yeahā¦ There are many options.
Might be worth to try C = vcat(A, B)
then
What is the distribution of elements on the big universe? If it isnāt uniform, then perhaps splitting into āheavy hittersā and rare elements is useful.
On the whole, it appears a Bloom filter will be a good data structure, combined with a vector of elements (filters donāt allow easy iteration).