Sparse set-like indexible data structure

Ward9250 · October 18, 2017, 2:50pm

I have a need for a data type in my code which is set-like, so stores only unique values, and can efficiently do intersections and unions and such, but that can also be indexed, and is efficient for sparse sets.

My possible values are 1 - 64, but I will only ever have a few - rarely more than 4 of those possible values, in any one set. Is there a data structure in julia base or a package which is well suited to those requirements?

Tamas_Papp · October 18, 2017, 3:17pm

Can you be more specific about indexing? Would any stable mapping from integers to elements do, or do you want sorting, indexing by insertion order, etc? A MWE would clarify things. That said, the first place I would look is

Ward9250 · October 18, 2017, 3:28pm

Specifically I want to index by integer because I want to work with all of the pairs of values that are possible in the set.

In an i = 1:length(myset), j = (i+1):endof(set) sort of way.

If I want to so that with Set in Base, I have to collect into an array and then work with the array. But I don’t want the extra array allocation, because I’m going to be building these small sets and doing this operation with them a huge number of times.

Tamas_Papp · October 18, 2017, 3:43pm

Perhaps you have already considered this, but you can iterate on a Set. It does not construct an intermediate vector, just traverses the elements (same mechanism that traverses a Dict, since Set is implemented as a special case of that).

mohamed82008 · October 18, 2017, 4:36pm

If you know your values to be integers from 1 to 64, you could use a sparse Boolean vector of length 64, which holds true value only for the indices mapping the elements in the set. The union is then just element-wise || and the intersect is element-wise &&, and all the values in the set will be stored in the field nzind sorted! I guess the least efficient bit is adding or removing elements from the middle of the vector but for small vectors, it may not be a problem.

mbauman · October 18, 2017, 5:06pm

This sounds a lot like a slightly more specialized IntSet. You can avoid much of the overhead since all your values fit into a single UInt64 bit mask, and even though indexing would have to be O(N), N is really small. Range subsetting is even easier — it’s just bitwise and with an appropriately constructed mask.

https://github.com/JuliaLang/julia/blob/master/base/intset.jl

Ward9250 · October 18, 2017, 5:57pm

I’ve been thinking it sounds like a very small bit-vector.

The issue for me whilst a 64bit number will store which ints are in the set.
When I want to get the nth value present in the set I have to iterate through each bit until I get to the nth 1 bit. Which in a sparse set is not great, it’s 64 iterations in the worst case for this situation.
Which isn’t great especially given I’m going to have a lot of these small sets. I was wondering if the addition of some sort of index would make that operation faster.

mbauman · October 18, 2017, 6:09pm

I think you’re overthinking it. These operations are fast. The difference between getting the first bit of an IntSet and the 64th bit of an IntSet is about a nanosecond. Doing anything more complicated adds more branches, more data, more code, and almost certainly more time.

julia> S = IntSet()
       push!(S, 64)
IntSet([64])

julia> @btime first($S) # effectively worst case S[1]
  29.776 ns (0 allocations: 0 bytes)
64

julia> push!(S, 1)
IntSet([1, 64])

julia> @btime first($S) # best case S[1]
  28.748 ns (0 allocations: 0 bytes)
1

And this could be even faster were it simply using one UInt64 instead of a BitVector.

Ward9250 · October 18, 2017, 7:07pm

Ok thanks! I shall make a small type that does this for 64 values. Just weighing up whether to make it a mutable struct allocated and managed by the gc, or an immutable primitive type, given the number of them I’ll be creating and manipulating. Although I may be able to refactor to reuse the same instance of the mutable again and again.

greg_plowman · October 19, 2017, 1:21am

You could also cache the Set values into a pre-allocated Array for faster indexed lookup. That way you only need the slower index into Set once.

Here’s an example to show what I mean:

Method 1: No caching - direct indexing into Set:

function Base.getindex(S::IntSet, i::Integer)
    if 1 <= i <= length(S)
        state = start(S)
        x = 0
        for j = 1:i
            x, state = next(S, state)
        end
        return x
    else
        throw(BoundsError(S, i))
    end
end

function pairstuff1(S::AbstractSet, isprint=false)
    for i = 1:length(S), j = (i+1):length(S)
        x1 = S[i]
        x2 = S[j]
    end
end

Method 2: Caching - index into Set once only, then index into pre-allocated Array

const A = Vector{Int}(64);

function pairstuff2(S::AbstractSet, isprint=false)
    # populate array
    i = 0
    for x in S
        i += 1
        A[i] = x
    end

    # pairwise stuff
    for i = 1:length(S), j = (i+1):length(S)
        x1 = A[i]
        x2 = A[j]
    end
end

Now some benchmarking …

using BenchmarkTools
S = IntSet([5, 10, 15, 55])
@benchmark pairstuff1($S)
@benchmark pairstuff2($S)

On my machine, cached version is about 5 times faster.

Topic		Replies	Views
How best to maintain collection of indices into 2-d array New to Julia	3	496	October 16, 2019
Indexable set data structure General Usage indexing , data_structures	8	446	April 2, 2024
Best data structure for fast unions of large sets of integers Performance performance , set , datastructures	56	971	May 6, 2024
Findfirst alternative for Set General Usage question	6	389	May 23, 2023
Tiny associative array (alist-like) New to Julia data_structures	4	837	March 26, 2021

Sparse set-like indexible data structure

Related topics