Sparse set-like indexible data structure

I have a need for a data type in my code which is set-like, so stores only unique values, and can efficiently do intersections and unions and such, but that can also be indexed, and is efficient for sparse sets.

My possible values are 1 - 64, but I will only ever have a few - rarely more than 4 of those possible values, in any one set. Is there a data structure in julia base or a package which is well suited to those requirements?

Can you be more specific about indexing? Would any stable mapping from integers to elements do, or do you want sorting, indexing by insertion order, etc? A MWE would clarify things. That said, the first place I would look is

Specifically I want to index by integer because I want to work with all of the pairs of values that are possible in the set.

In an i = 1:length(myset), j = (i+1):endof(set) sort of way.

If I want to so that with Set in Base, I have to collect into an array and then work with the array. But I don’t want the extra array allocation, because I’m going to be building these small sets and doing this operation with them a huge number of times.

Perhaps you have already considered this, but you can iterate on a Set. It does not construct an intermediate vector, just traverses the elements (same mechanism that traverses a Dict, since Set is implemented as a special case of that).

If you know your values to be integers from 1 to 64, you could use a sparse Boolean vector of length 64, which holds true value only for the indices mapping the elements in the set. The union is then just element-wise || and the intersect is element-wise &&, and all the values in the set will be stored in the field nzind sorted! I guess the least efficient bit is adding or removing elements from the middle of the vector but for small vectors, it may not be a problem.

This sounds a lot like a slightly more specialized IntSet. You can avoid much of the overhead since all your values fit into a single UInt64 bit mask, and even though indexing would have to be O(N), N is really small. Range subsetting is even easier — it’s just bitwise and with an appropriately constructed mask.

https://github.com/JuliaLang/julia/blob/master/base/intset.jl

1 Like

I’ve been thinking it sounds like a very small bit-vector.

The issue for me whilst a 64bit number will store which ints are in the set.
When I want to get the nth value present in the set I have to iterate through each bit until I get to the nth 1 bit. Which in a sparse set is not great, it’s 64 iterations in the worst case for this situation.
Which isn’t great especially given I’m going to have a lot of these small sets. I was wondering if the addition of some sort of index would make that operation faster.

I think you’re overthinking it. These operations are fast. The difference between getting the first bit of an IntSet and the 64th bit of an IntSet is about a nanosecond. Doing anything more complicated adds more branches, more data, more code, and almost certainly more time.

julia> S = IntSet()
       push!(S, 64)
IntSet([64])

julia> @btime first($S) # effectively worst case S[1]
  29.776 ns (0 allocations: 0 bytes)
64

julia> push!(S, 1)
IntSet([1, 64])

julia> @btime first($S) # best case S[1]
  28.748 ns (0 allocations: 0 bytes)
1

And this could be even faster were it simply using one UInt64 instead of a BitVector.

3 Likes

Ok thanks! I shall make a small type that does this for 64 values. Just weighing up whether to make it a mutable struct allocated and managed by the gc, or an immutable primitive type, given the number of them I’ll be creating and manipulating. Although I may be able to refactor to reuse the same instance of the mutable again and again.

You could also cache the Set values into a pre-allocated Array for faster indexed lookup. That way you only need the slower index into Set once.

Here’s an example to show what I mean:

Method 1: No caching - direct indexing into Set:

function Base.getindex(S::IntSet, i::Integer)
    if 1 <= i <= length(S)
        state = start(S)
        x = 0
        for j = 1:i
            x, state = next(S, state)
        end
        return x
    else
        throw(BoundsError(S, i))
    end
end

function pairstuff1(S::AbstractSet, isprint=false)
    for i = 1:length(S), j = (i+1):length(S)
        x1 = S[i]
        x2 = S[j]
    end
end

Method 2: Caching - index into Set once only, then index into pre-allocated Array

const A = Vector{Int}(64);

function pairstuff2(S::AbstractSet, isprint=false)
    # populate array
    i = 0
    for x in S
        i += 1
        A[i] = x
    end

    # pairwise stuff
    for i = 1:length(S), j = (i+1):length(S)
        x1 = A[i]
        x2 = A[j]
    end
end

Now some benchmarking …

using BenchmarkTools
S = IntSet([5, 10, 15, 55])
@benchmark pairstuff1($S)
@benchmark pairstuff2($S)

On my machine, cached version is about 5 times faster.

1 Like