Indexable set data structure

I’m looking for a data structure that I’m sure must have already been invented and implemented, but which I can’t find (probably because I don’t know its canonical name).

The data structure I have in mind shares some properties with both a Set/Dict and a Vector:

  • it only stores unique values (like a Set)
  • values can be efficiently accessed by a linear index (like a Vector)
  • given a value, it should be fast (by which I mean something like \mathcal{O}(1) complexity) to check whether it is in the set, and if so retrieve its index.

To illustrate what I mean, the following is a bare-bone (probably not optimal) implementation providing all features I need:

struct IndexedSet{T}
    idx :: Dict{T, Int}
    val :: Vector{T}
    IndexedSet{T}() where {T} = new{T}(Dict(), [])
end

function Base.push!(is::IndexedSet, val)
    get!(is.idx, val) do
        push!(is.val, val)
        lastindex(is.val)
    end
end

Base.iterate(is::IndexedSet, args...) = iterate(is.val, args...)
Base.in(val, is::IndexedSet) = val ∈ keys(is.idx)
Base.length(is::IndexedSet) = length(is.val)
Base.getindex(is::IndexedSet, idx) = is.val[idx]
indexof(is::IndexedSet, val) = is.idx[val]
# Tests / usage example
is = IndexedSet{String}()
push!(is, "foo")
push!(is, "bar")
push!(is, "baz")

push!(is, "foo") # no-op, since "foo" is already in the set
@assert length(is) == 3

# These should all be fast
@assert "bar" ∈ is
@assert is[3] == "baz"
@assert indexof(is, "baz") == 3

Isn’t this a classical data structure? Does it have a name?

I looked in DataStructures.jl and found OrderedSet or SortedSet, which come close but don’t provide everything I need (in particular no way to efficiently retrieve the index of a value as far as I could see).

1 Like
julia> using DataStructures, BenchmarkTools

julia> is = OrderedSet{String}()
OrderedSet{String}()

julia> push!(is, "foo")
OrderedSet{String} with 1 element:
  "foo"

julia> push!(is, "bar")
OrderedSet{String} with 2 elements:
  "foo"
  "bar"

julia> push!(is, "baz")
OrderedSet{String} with 3 elements:
  "foo"
  "bar"
  "baz"

julia> push!(is, "foo")
OrderedSet{String} with 3 elements:
  "foo"
  "bar"
  "baz"

julia> length(is) == 3
true

julia> @btime "bar" ∈ $is
  16.216 ns (0 allocations: 0 bytes)
true

julia> @btime $is[3] == "baz"
  12.513 ns (0 allocations: 0 bytes)
true

julia> @btime findfirst(==("baz"), $is) == 3
  26.104 ns (0 allocations: 0 bytes)
true

OrderedSet seems pretty fast to me.

1 Like

Thanks, but the search seems to have linear complexity: it’s only fast when the OrderedSet contains few elements.

julia> os = OrderedSet{String}()
OrderedSet{String}()

julia> for i in 1:1000
           push!(os, "foo_$i")
       end

julia> @btime findfirst(==("foo_500"), $os)
  1.565 μs (0 allocations: 0 bytes)
500

This is why I think “fast” in this context should mean “constant complexity”. For example, contrast the OrderedSet performance with an IndexedSet as implemented above:

julia> is = IndexedSet{String}()
IndexedSet{String}(Dict{String, Int64}(), String[])

julia> for i in 1:1000
           push!(is, "foo_$i")
       end

julia> @btime indexof($is, "foo_500")
  12.531 ns (0 allocations: 0 bytes)
500
2 Likes

ok, my previous response was not a good one, but I guess it was on a good track because maybe instead Indices{T} from Dictionaries.jl seems suitable:

julia> indset = Indices([1,2,3])
3-element Indices{Int64}
 1
 2
 3

julia> insert!(indset, 5)
4-element Indices{Int64}
 1
 2
 3
 5

julia> indset.values
4-element Vector{Int64}:
 1
 2
 3
 5

julia> 1 in indset
true

on your benchmark

julia> is = Indices{String}()
0-element Indices{String}

julia> for i in 1:1000
                  insert!(is, "foo_$i")
              end

julia> using BenchmarkTools

julia> @btime "foo_500" in $is
  12.566 ns (0 allocations: 0 bytes)
true

Edit: actually probably a Dictionary{T, Int} could be better to be able to create a indexof function


Edit2: actually for Indices there is gettoken

1 Like

This looks like a nice data structure!

I just noticed that your push! function does not return the original container you are pushing to, but one of its fields. Thus may lead to unexpected behavior.

1 Like

Maybe:

julia> using UniqueVectors

julia> uv = UniqueVector(("foo","bar","baz"))
3-element UniqueVector{String}:
 "foo"
 "bar"
 "baz"

julia> dump(uv)
UniqueVector{String}
  items: Array{String}((3,))
    1: String "foo"
    2: String "bar"
    3: String "baz"
  lookup: Dict{String, Int64}
    slots: Array{UInt8}((16,)) UInt8[0x00, 0x00, 0xb5, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xae, 0x00, 0x00, 0x00, 0x00, 0xe0]
    keys: Array{String}((16,))
      1: #undef
      2: #undef
      3: String "bar"

julia> for i in 1:1000
                  push!(uv, "foo_$i")
              end;

julia> @btime "foo_500" in $uv
  min 25.669 ns, mean 26.122 ns (0 allocations)
true

julia> @btime "foo_500" in $(collect(uv)::Vector)
  min 4.107 μs, mean 4.634 μs (0 allocations)
true

julia> @btime findfirst(isequal("foo_500"), $uv)
  min 27.679 ns, mean 28.534 ns (0 allocations)
503
2 Likes

A comparison between Indices and UniqueVector on my machine:

julia> using Dictionaries, UniqueVectors, BenchmarkTools

julia> uv = UniqueVector{String}()
0-element UniqueVector{String}

julia> for i in 1:1000
           push!(uv, "foo_$i")
       end;

julia> is = Indices{String}()
0-element Indices{String}

julia> for i in 1:1000
           insert!(is, "foo_$i")
       end;

julia> @btime "foo_500" in $uv
  13.230 ns (0 allocations: 0 bytes)
true

julia> @btime "foo_500" in $is
  12.747 ns (0 allocations: 0 bytes)
true

julia> @btime findfirst(isequal("foo_500"), $uv)
  14.918 ns (0 allocations: 0 bytes)
500

julia> @btime gettoken($is, "foo_500")[2][2]
  12.546 ns (0 allocations: 0 bytes)
500

julia> Base.summarysize(uv)
92765

julia> Base.summarysize(is)
63813

so they seem pretty similar performance-wise, but Indices seems to have the edge, also Indices seems more memory optimized

2 Likes

Thanks a lot! Indices from Dictionaries.jl indeed seems to be exactly what I need.
For the sake of completeness, here is a comparison of how to perform linear indexing for both data structures:

julia> @btime $uv[500]
  1.761 ns (0 allocations: 0 bytes)
"foo_500"

julia> @btime gettokenvalue($is, 500)
  2.021 ns (0 allocations: 0 bytes)
"foo_500"

The only concern I have is that “tokens” seem to be part of the interface describing how to implement new subtypes of AbstractIndices, rather than the public interface of Indices{T}. By which I mean that the docs don’t give me the impression that it would be guaranteed that tokentype(Indices{String}()) == Int or that the token values are contiguous integers starting from 1. But I’m not expecting that to change either, so it should be fine in practice.

1 Like

The token values are only contiguous if there are no holes (e.g. because you never delete items).

1 Like