# Indexable set data structure

I’m looking for a data structure that I’m sure must have already been invented and implemented, but which I can’t find (probably because I don’t know its canonical name).

The data structure I have in mind shares some properties with both a Set/Dict and a Vector:

• it only stores unique values (like a Set)
• values can be efficiently accessed by a linear index (like a Vector)
• given a value, it should be fast (by which I mean something like \mathcal{O}(1) complexity) to check whether it is in the set, and if so retrieve its index.

To illustrate what I mean, the following is a bare-bone (probably not optimal) implementation providing all features I need:

struct IndexedSet{T}
idx :: Dict{T, Int}
val :: Vector{T}
IndexedSet{T}() where {T} = new{T}(Dict(), [])
end

function Base.push!(is::IndexedSet, val)
get!(is.idx, val) do
push!(is.val, val)
lastindex(is.val)
end
end

Base.iterate(is::IndexedSet, args...) = iterate(is.val, args...)
Base.in(val, is::IndexedSet) = val ∈ keys(is.idx)
Base.length(is::IndexedSet) = length(is.val)
Base.getindex(is::IndexedSet, idx) = is.val[idx]
indexof(is::IndexedSet, val) = is.idx[val]

# Tests / usage example
is = IndexedSet{String}()
push!(is, "foo")
push!(is, "bar")
push!(is, "baz")

push!(is, "foo") # no-op, since "foo" is already in the set
@assert length(is) == 3

# These should all be fast
@assert "bar" ∈ is
@assert is[3] == "baz"
@assert indexof(is, "baz") == 3


Isn’t this a classical data structure? Does it have a name?

I looked in DataStructures.jl and found OrderedSet or SortedSet, which come close but don’t provide everything I need (in particular no way to efficiently retrieve the index of a value as far as I could see).

1 Like
julia> using DataStructures, BenchmarkTools

julia> is = OrderedSet{String}()
OrderedSet{String}()

julia> push!(is, "foo")
OrderedSet{String} with 1 element:
"foo"

julia> push!(is, "bar")
OrderedSet{String} with 2 elements:
"foo"
"bar"

julia> push!(is, "baz")
OrderedSet{String} with 3 elements:
"foo"
"bar"
"baz"

julia> push!(is, "foo")
OrderedSet{String} with 3 elements:
"foo"
"bar"
"baz"

julia> length(is) == 3
true

julia> @btime "bar" ∈ $is 16.216 ns (0 allocations: 0 bytes) true julia> @btime$is[3] == "baz"
12.513 ns (0 allocations: 0 bytes)
true

julia> @btime findfirst(==("baz"), $is) == 3 26.104 ns (0 allocations: 0 bytes) true  OrderedSet seems pretty fast to me. 1 Like Thanks, but the search seems to have linear complexity: it’s only fast when the OrderedSet contains few elements. julia> os = OrderedSet{String}() OrderedSet{String}() julia> for i in 1:1000 push!(os, "foo_$i")
end

julia> @btime findfirst(==("foo_500"), $os) 1.565 μs (0 allocations: 0 bytes) 500  This is why I think “fast” in this context should mean “constant complexity”. For example, contrast the OrderedSet performance with an IndexedSet as implemented above: julia> is = IndexedSet{String}() IndexedSet{String}(Dict{String, Int64}(), String[]) julia> for i in 1:1000 push!(is, "foo_$i")
end

julia> @btime indexof($is, "foo_500") 12.531 ns (0 allocations: 0 bytes) 500  2 Likes ok, my previous response was not a good one, but I guess it was on a good track because maybe instead Indices{T} from Dictionaries.jl seems suitable: julia> indset = Indices([1,2,3]) 3-element Indices{Int64} 1 2 3 julia> insert!(indset, 5) 4-element Indices{Int64} 1 2 3 5 julia> indset.values 4-element Vector{Int64}: 1 2 3 5 julia> 1 in indset true  on your benchmark julia> is = Indices{String}() 0-element Indices{String} julia> for i in 1:1000 insert!(is, "foo_$i")
end

julia> using BenchmarkTools

julia> @btime "foo_500" in $is 12.566 ns (0 allocations: 0 bytes) true  Edit: actually probably a Dictionary{T, Int} could be better to be able to create a indexof function Edit2: actually for Indices there is gettoken 1 Like This looks like a nice data structure! I just noticed that your push! function does not return the original container you are pushing to, but one of its fields. Thus may lead to unexpected behavior. 1 Like Maybe: julia> using UniqueVectors julia> uv = UniqueVector(("foo","bar","baz")) 3-element UniqueVector{String}: "foo" "bar" "baz" julia> dump(uv) UniqueVector{String} items: Array{String}((3,)) 1: String "foo" 2: String "bar" 3: String "baz" lookup: Dict{String, Int64} slots: Array{UInt8}((16,)) UInt8[0x00, 0x00, 0xb5, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xae, 0x00, 0x00, 0x00, 0x00, 0xe0] keys: Array{String}((16,)) 1: #undef 2: #undef 3: String "bar" julia> for i in 1:1000 push!(uv, "foo_$i")
end;

julia> @btime "foo_500" in $uv min 25.669 ns, mean 26.122 ns (0 allocations) true julia> @btime "foo_500" in$(collect(uv)::Vector)
min 4.107 μs, mean 4.634 μs (0 allocations)
true

julia> @btime findfirst(isequal("foo_500"), $uv) min 27.679 ns, mean 28.534 ns (0 allocations) 503  2 Likes A comparison between Indices and UniqueVector on my machine: julia> using Dictionaries, UniqueVectors, BenchmarkTools julia> uv = UniqueVector{String}() 0-element UniqueVector{String} julia> for i in 1:1000 push!(uv, "foo_$i")
end;

julia> is = Indices{String}()
0-element Indices{String}

julia> for i in 1:1000
insert!(is, "foo_$i") end; julia> @btime "foo_500" in$uv
13.230 ns (0 allocations: 0 bytes)
true

julia> @btime "foo_500" in $is 12.747 ns (0 allocations: 0 bytes) true julia> @btime findfirst(isequal("foo_500"),$uv)
14.918 ns (0 allocations: 0 bytes)
500

julia> @btime gettoken($is, "foo_500")[2][2] 12.546 ns (0 allocations: 0 bytes) 500 julia> Base.summarysize(uv) 92765 julia> Base.summarysize(is) 63813  so they seem pretty similar performance-wise, but Indices seems to have the edge, also Indices seems more memory optimized 2 Likes Thanks a lot! Indices from Dictionaries.jl indeed seems to be exactly what I need. For the sake of completeness, here is a comparison of how to perform linear indexing for both data structures: julia> @btime$uv[500]
1.761 ns (0 allocations: 0 bytes)
"foo_500"

julia> @btime gettokenvalue(\$is, 500)
2.021 ns (0 allocations: 0 bytes)
"foo_500"


The only concern I have is that “tokens” seem to be part of the interface describing how to implement new subtypes of AbstractIndices, rather than the public interface of Indices{T}. By which I mean that the docs don’t give me the impression that it would be guaranteed that tokentype(Indices{String}()) == Int or that the token values are contiguous integers starting from 1. But I’m not expecting that to change either, so it should be fine in practice.

1 Like

The token values are only contiguous if there are no holes (e.g. because you never delete items).

1 Like