Sparse set-like indexible data structure

question

#1

I have a need for a data type in my code which is set-like, so stores only unique values, and can efficiently do intersections and unions and such, but that can also be indexed, and is efficient for sparse sets.

My possible values are 1 - 64, but I will only ever have a few - rarely more than 4 of those possible values, in any one set. Is there a data structure in julia base or a package which is well suited to those requirements?


#2

Can you be more specific about indexing? Would any stable mapping from integers to elements do, or do you want sorting, indexing by insertion order, etc? A MWE would clarify things. That said, the first place I would look is


#3

Specifically I want to index by integer because I want to work with all of the pairs of values that are possible in the set.

In an i = 1:length(myset), j = (i+1):endof(set) sort of way.

If I want to so that with Set in Base, I have to collect into an array and then work with the array. But I don’t want the extra array allocation, because I’m going to be building these small sets and doing this operation with them a huge number of times.


#4

Perhaps you have already considered this, but you can iterate on a Set. It does not construct an intermediate vector, just traverses the elements (same mechanism that traverses a Dict, since Set is implemented as a special case of that).


#5

If you know your values to be integers from 1 to 64, you could use a sparse Boolean vector of length 64, which holds true value only for the indices mapping the elements in the set. The union is then just element-wise || and the intersect is element-wise &&, and all the values in the set will be stored in the field nzind sorted! I guess the least efficient bit is adding or removing elements from the middle of the vector but for small vectors, it may not be a problem.


#6

This sounds a lot like a slightly more specialized IntSet. You can avoid much of the overhead since all your values fit into a single UInt64 bit mask, and even though indexing would have to be O(N), N is really small. Range subsetting is even easier — it’s just bitwise and with an appropriately constructed mask.

https://github.com/JuliaLang/julia/blob/master/base/intset.jl


#7

I’ve been thinking it sounds like a very small bit-vector.

The issue for me whilst a 64bit number will store which ints are in the set.
When I want to get the nth value present in the set I have to iterate through each bit until I get to the nth 1 bit. Which in a sparse set is not great, it’s 64 iterations in the worst case for this situation.
Which isn’t great especially given I’m going to have a lot of these small sets. I was wondering if the addition of some sort of index would make that operation faster.


#8

I think you’re overthinking it. These operations are fast. The difference between getting the first bit of an IntSet and the 64th bit of an IntSet is about a nanosecond. Doing anything more complicated adds more branches, more data, more code, and almost certainly more time.

julia> S = IntSet()
       push!(S, 64)
IntSet([64])

julia> @btime first($S) # effectively worst case S[1]
  29.776 ns (0 allocations: 0 bytes)
64

julia> push!(S, 1)
IntSet([1, 64])

julia> @btime first($S) # best case S[1]
  28.748 ns (0 allocations: 0 bytes)
1

And this could be even faster were it simply using one UInt64 instead of a BitVector.


#9

Ok thanks! I shall make a small type that does this for 64 values. Just weighing up whether to make it a mutable struct allocated and managed by the gc, or an immutable primitive type, given the number of them I’ll be creating and manipulating. Although I may be able to refactor to reuse the same instance of the mutable again and again.


#10

You could also cache the Set values into a pre-allocated Array for faster indexed lookup. That way you only need the slower index into Set once.

Here’s an example to show what I mean:

Method 1: No caching - direct indexing into Set:

function Base.getindex(S::IntSet, i::Integer)
    if 1 <= i <= length(S)
        state = start(S)
        x = 0
        for j = 1:i
            x, state = next(S, state)
        end
        return x
    else
        throw(BoundsError(S, i))
    end
end

function pairstuff1(S::AbstractSet, isprint=false)
    for i = 1:length(S), j = (i+1):length(S)
        x1 = S[i]
        x2 = S[j]
    end
end

Method 2: Caching - index into Set once only, then index into pre-allocated Array

const A = Vector{Int}(64);

function pairstuff2(S::AbstractSet, isprint=false)
    # populate array
    i = 0
    for x in S
        i += 1
        A[i] = x
    end

    # pairwise stuff
    for i = 1:length(S), j = (i+1):length(S)
        x1 = A[i]
        x2 = A[j]
    end
end

Now some benchmarking …

using BenchmarkTools
S = IntSet([5, 10, 15, 55])
@benchmark pairstuff1($S)
@benchmark pairstuff2($S)

On my machine, cached version is about 5 times faster.