Indexable set data structure

I’m looking for a data structure that I’m sure must have already been invented and implemented, but which I can’t find (probably because I don’t know its canonical name).

The data structure I have in mind shares some properties with both a Set/Dict and a Vector:

  • it only stores unique values (like a Set)
  • values can be efficiently accessed by a linear index (like a Vector)
  • given a value, it should be fast (by which I mean something like \mathcal{O}(1) complexity) to check whether it is in the set, and if so retrieve its index.

To illustrate what I mean, the following is a bare-bone (probably not optimal) implementation providing all features I need:

struct IndexedSet{T}
    idx :: Dict{T, Int}
    val :: Vector{T}
    IndexedSet{T}() where {T} = new{T}(Dict(), [])
end

function Base.push!(is::IndexedSet, val)
    get!(is.idx, val) do
        push!(is.val, val)
        lastindex(is.val)
    end
end

Base.iterate(is::IndexedSet, args...) = iterate(is.val, args...)
Base.in(val, is::IndexedSet) = val ∈ keys(is.idx)
Base.length(is::IndexedSet) = length(is.val)
Base.getindex(is::IndexedSet, idx) = is.val[idx]
indexof(is::IndexedSet, val) = is.idx[val]
# Tests / usage example
is = IndexedSet{String}()
push!(is, "foo")
push!(is, "bar")
push!(is, "baz")

push!(is, "foo") # no-op, since "foo" is already in the set
@assert length(is) == 3

# These should all be fast
@assert "bar" ∈ is
@assert is[3] == "baz"
@assert indexof(is, "baz") == 3

Isn’t this a classical data structure? Does it have a name?

I looked in DataStructures.jl and found OrderedSet or SortedSet, which come close but don’t provide everything I need (in particular no way to efficiently retrieve the index of a value as far as I could see).

julia> using DataStructures, BenchmarkTools

julia> is = OrderedSet{String}()
OrderedSet{String}()

julia> push!(is, "foo")
OrderedSet{String} with 1 element:
  "foo"

julia> push!(is, "bar")
OrderedSet{String} with 2 elements:
  "foo"
  "bar"

julia> push!(is, "baz")
OrderedSet{String} with 3 elements:
  "foo"
  "bar"
  "baz"

julia> push!(is, "foo")
OrderedSet{String} with 3 elements:
  "foo"
  "bar"
  "baz"

julia> length(is) == 3
true

julia> @btime "bar" ∈ $is
  16.216 ns (0 allocations: 0 bytes)
true

julia> @btime $is[3] == "baz"
  12.513 ns (0 allocations: 0 bytes)
true

julia> @btime findfirst(==("baz"), $is) == 3
  26.104 ns (0 allocations: 0 bytes)
true

OrderedSet seems pretty fast to me.

Thanks, but the search seems to have linear complexity: it’s only fast when the OrderedSet contains few elements.

julia> os = OrderedSet{String}()
OrderedSet{String}()

julia> for i in 1:1000
           push!(os, "foo_$i")
       end

julia> @btime findfirst(==("foo_500"), $os)
  1.565 μs (0 allocations: 0 bytes)
500

This is why I think “fast” in this context should mean “constant complexity”. For example, contrast the OrderedSet performance with an IndexedSet as implemented above:

julia> is = IndexedSet{String}()
IndexedSet{String}(Dict{String, Int64}(), String[])

julia> for i in 1:1000
           push!(is, "foo_$i")
       end

julia> @btime indexof($is, "foo_500")
  12.531 ns (0 allocations: 0 bytes)
500

ok, my previous response was not a good one, but I guess it was on a good track because maybe instead Indices{T} from Dictionaries.jl seems suitable:

julia> indset = Indices([1,2,3])
3-element Indices{Int64}
 1
 2
 3

julia> insert!(indset, 5)
4-element Indices{Int64}
 1
 2
 3
 5

julia> indset.values
4-element Vector{Int64}:
 1
 2
 3
 5

julia> 1 in indset
true

on your benchmark

julia> is = Indices{String}()
0-element Indices{String}

julia> for i in 1:1000
                  insert!(is, "foo_$i")
              end

julia> using BenchmarkTools

julia> @btime "foo_500" in $is
  12.566 ns (0 allocations: 0 bytes)
true

Edit: actually probably a Dictionary{T, Int} could be better to be able to create a indexof function


Edit2: actually for Indices there is gettoken

This looks like a nice data structure!

I just noticed that your push! function does not return the original container you are pushing to, but one of its fields. Thus may lead to unexpected behavior.

Maybe:

julia> using UniqueVectors

julia> uv = UniqueVector(("foo","bar","baz"))
3-element UniqueVector{String}:
 "foo"
 "bar"
 "baz"

julia> dump(uv)
UniqueVector{String}
  items: Array{String}((3,))
    1: String "foo"
    2: String "bar"
    3: String "baz"
  lookup: Dict{String, Int64}
    slots: Array{UInt8}((16,)) UInt8[0x00, 0x00, 0xb5, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xae, 0x00, 0x00, 0x00, 0x00, 0xe0]
    keys: Array{String}((16,))
      1: #undef
      2: #undef
      3: String "bar"

julia> for i in 1:1000
                  push!(uv, "foo_$i")
              end;

julia> @btime "foo_500" in $uv
  min 25.669 ns, mean 26.122 ns (0 allocations)
true

julia> @btime "foo_500" in $(collect(uv)::Vector)
  min 4.107 μs, mean 4.634 μs (0 allocations)
true

julia> @btime findfirst(isequal("foo_500"), $uv)
  min 27.679 ns, mean 28.534 ns (0 allocations)
503

A comparison between Indices and UniqueVector on my machine:

julia> using Dictionaries, UniqueVectors, BenchmarkTools

julia> uv = UniqueVector{String}()
0-element UniqueVector{String}

julia> for i in 1:1000
           push!(uv, "foo_$i")
       end;

julia> is = Indices{String}()
0-element Indices{String}

julia> for i in 1:1000
           insert!(is, "foo_$i")
       end;

julia> @btime "foo_500" in $uv
  13.230 ns (0 allocations: 0 bytes)
true

julia> @btime "foo_500" in $is
  12.747 ns (0 allocations: 0 bytes)
true

julia> @btime findfirst(isequal("foo_500"), $uv)
  14.918 ns (0 allocations: 0 bytes)
500

julia> @btime gettoken($is, "foo_500")[2][2]
  12.546 ns (0 allocations: 0 bytes)
500

julia> Base.summarysize(uv)
92765

julia> Base.summarysize(is)
63813

so they seem pretty similar performance-wise, but Indices seems to have the edge, also Indices seems more memory optimized

Thanks a lot! Indices from Dictionaries.jl indeed seems to be exactly what I need.
For the sake of completeness, here is a comparison of how to perform linear indexing for both data structures:

julia> @btime $uv[500]
  1.761 ns (0 allocations: 0 bytes)
"foo_500"

julia> @btime gettokenvalue($is, 500)
  2.021 ns (0 allocations: 0 bytes)
"foo_500"

The only concern I have is that “tokens” seem to be part of the interface describing how to implement new subtypes of AbstractIndices, rather than the public interface of Indices{T}. By which I mean that the docs don’t give me the impression that it would be guaranteed that tokentype(Indices{String}()) == Int or that the token values are contiguous integers starting from 1. But I’m not expecting that to change either, so it should be fine in practice.

The token values are only contiguous if there are no holes (e.g. because you never delete items).