Best data structure for fast unions of large sets of integers

gdalle · May 4, 2024, 1:29pm

I need to iterate through the objects that represent my sets, but only at the very end of the procedure, so it’s not the time-consuming part.

Unions will always be pairwise. Basically it’s about an operator overloading system where numbers are replaced by sets of integers, and all the usual operations are replaced by set unions. Don’t ask why it’s useful, trust @hill on this

Yes but duplicates might slow down other unions further down the line, by making the sets bigger.

That’s actually a really nice and simple idea. If there are never any duplicates this is optimal, but I cannot know it ahead of time

Tarny_GG_Channie · May 4, 2024, 1:58pm

I don’t see why not just keep a list (array,vector, etc) of sets that form the union. When you need to iterate through it, just iterate through each one and deduplicate with some mean.

Maybe even something like…

struct UnionSet{T}
A::Set{T}
B::Set{T}
end

Union is really fast, just copying two pointers.

Dan · May 4, 2024, 2:09pm

Considering the problem, switching from union to union! is a big deal. Essentially, just one BitSet is needed.

As bears out when actually running benchmark with union!.

gdalle · May 4, 2024, 2:28pm

I thought about that but I feared it might make the memory requirements of the program scale linearly with its depth. Imagine that we’re doing this operator overloading with a neural network:

If we create a new set at every operation, we can forget everything that happened in previous layers
If we do it with pointers, we have to keep track of every array of pointers since the beginning

That is an interesting insight!
However, conceptually these set objects are used like numbers in the program due to operator overloading. And when I do s1 + s2, i can’t afford to modify either argument in case I need them somewhere else for s1 + s3. Basically, since numbers are usually not mutable, there seems to be no way to leverage the mutability of sets

sumiya11 · May 4, 2024, 3:15pm

If you are evaluating arithmetic expressions over sets then reordering the union operations would probably be beneficial

Tarny_GG_Channie · May 4, 2024, 4:21pm

I understand your concern but I’m actually thinking this is a very memory efficient way. Let’s say A is 1000 elements and B is 1000 elements, same with C and so on, if you create A|B, then you might get a new 2000 elements set. Now, if you create A|C, you might get another 2000 elements set. Now, even if you forgot A, B, and C, you still end up consuming a lot of extra memories. Compare this to just keeping a few pointers.

Janis_Erdmanis · May 4, 2024, 5:55pm

I’m joining this discussion a bit late, but I think there might be a viable approach involving the idea of a bitset combined with run-length encoding in a bitvector which woulde address issues related to sparsity. To construct a union, one would need to iterate through both bitvectors to create a third one. However, a significant drawback is that this process requires handling elements individually rather than using bitwise operations which processes result in chunks of 64 bits, as is possible with a bitset.

One possible optimization to speed up the process involves handling large gaps in one set: bits from the other set can be ORed into the resulting set within these gaps. The main challenge here is determining where to stop copying, for which I can’t think of a neat solution.

gdalle · May 5, 2024, 5:05am

That’s fair. I guess it will depend on the specific application, and the depth of the computational graph. I’ll implement it to see

gdalle · May 5, 2024, 5:05am

I’m not sure I understand what you mean

Janis_Erdmanis · May 5, 2024, 9:43am

It is similar to listing out indices when one occurs. The run-length encoding lists the 0s between every one and encodes them in the bit vector. This is generally used as a simple compression strategy when the only known thing about the bivector is that it is sparse.

A very crude implementation is here. I see a benefit if there are long gaps in one of the sets. If another set has many elements, the union would require copying the bits into the resulting bit vector from it directly, which can be done in chunks and thus could be faster than pushing individual integers into a vector. To have such optimisation, though, one would need to encode information on how many bits a given chunk represents.

rocco_sprmnt21 · May 5, 2024, 2:25pm

I too had thought of an approximate solution like this.
I propose a less brutal, albeit slower, one.

[A;deleteat!(B,findall(!isnothing,indexin(B,A)))]

If I have not misunderstood, the operations to be performed are simply the union of a set of values.
Something along the following lines could be adopted


A = Set(rand(1:universe_size, set_size))
B = Set(rand(1:universe_size, set_size))
C = rand(1:universe_size, set_size)

function byteset_un(v1::Array, v2::BitArray=falses(10^6))
    for b in v1
        v2[b]=true
    end
    v2
end

byteset_un(v1::Array, v2::Array) = byteset_un(v2,byteset_un(v1))

byteset_un(v1::BitArray, v2::BitArray) = v1 .| v2


byteset_un(A,B)

bA=byteset_un(A)

bB=byteset_un(B,bA)
bC=byteset_un(C,bB)

bAB=byteset_un(bA,bA)

gdalle · May 6, 2024, 4:57pm

See this new topic for a discussion of the recursive set idea, if the performance nerds are not tired yet!

GunnarFarneback · May 6, 2024, 6:11pm

I’m also very late to this party. Is the union-find data structure useful for this application?

gdalle · May 6, 2024, 6:13pm

I don’t think so because I am not sure that the sets I union will be disjoint (in fact, I’m pretty sure they won’t be in most cases)

GunnarFarneback · May 6, 2024, 6:14pm

I don’t think that’s a requirement, just an internal property of the data structure.

Mason · May 6, 2024, 6:20pm

I’d assume it’s faster to make them disjoint than it would be to merge them for most use-cases though right? e.g.

julia> let sets = (Set([1,2]), Set([2, 3, 4]), Set([4, 5, 6]))
           for i ∈ 1:(length(sets)-1)
               setdiff!(sets[i], sets[i+1:end]...)
           end
           sets
       end
(Set([1]), Set([2, 3]), Set([5, 4, 6]))

gdalle · May 6, 2024, 6:23pm

Maybe but you have to imagine that each set stands for a number, and that operations between two numbers like x_1 + x_2 or x_1 x_2 are replaced by s_1 \cup s_2. Now imagine that you plug those set-numbers inside a neural network. What I want to know is the sets that you get at the very end, which track the dependencies of the output numbers on the input numbers. Making the sets disjoint at each layer, for each dimension, would be a waste of time