I’m looking for a data structure that I’m sure must have already been invented and implemented, but which I can’t find (probably because I don’t know its canonical name).
The data structure I have in mind shares some properties with both a Set/Dict and a Vector:
it only stores unique values (like a Set)
values can be efficiently accessed by a linear index (like a Vector)
given a value, it should be fast (by which I mean something like \mathcal{O}(1) complexity) to check whether it is in the set, and if so retrieve its index.
To illustrate what I mean, the following is a bare-bone (probably not optimal) implementation providing all features I need:
struct IndexedSet{T}
idx :: Dict{T, Int}
val :: Vector{T}
IndexedSet{T}() where {T} = new{T}(Dict(), [])
end
function Base.push!(is::IndexedSet, val)
get!(is.idx, val) do
push!(is.val, val)
lastindex(is.val)
end
end
Base.iterate(is::IndexedSet, args...) = iterate(is.val, args...)
Base.in(val, is::IndexedSet) = val ∈ keys(is.idx)
Base.length(is::IndexedSet) = length(is.val)
Base.getindex(is::IndexedSet, idx) = is.val[idx]
indexof(is::IndexedSet, val) = is.idx[val]
# Tests / usage example
is = IndexedSet{String}()
push!(is, "foo")
push!(is, "bar")
push!(is, "baz")
push!(is, "foo") # no-op, since "foo" is already in the set
@assert length(is) == 3
# These should all be fast
@assert "bar" ∈ is
@assert is[3] == "baz"
@assert indexof(is, "baz") == 3
Isn’t this a classical data structure? Does it have a name?
I looked in DataStructures.jl and found OrderedSet or SortedSet, which come close but don’t provide everything I need (in particular no way to efficiently retrieve the index of a value as far as I could see).
Thanks, but the search seems to have linear complexity: it’s only fast when the OrderedSet contains few elements.
julia> os = OrderedSet{String}()
OrderedSet{String}()
julia> for i in 1:1000
push!(os, "foo_$i")
end
julia> @btime findfirst(==("foo_500"), $os)
1.565 μs (0 allocations: 0 bytes)
500
This is why I think “fast” in this context should mean “constant complexity”. For example, contrast the OrderedSet performance with an IndexedSet as implemented above:
julia> is = IndexedSet{String}()
IndexedSet{String}(Dict{String, Int64}(), String[])
julia> for i in 1:1000
push!(is, "foo_$i")
end
julia> @btime indexof($is, "foo_500")
12.531 ns (0 allocations: 0 bytes)
500
julia> is = Indices{String}()
0-element Indices{String}
julia> for i in 1:1000
insert!(is, "foo_$i")
end
julia> using BenchmarkTools
julia> @btime "foo_500" in $is
12.566 ns (0 allocations: 0 bytes)
true
Edit: actually probably a Dictionary{T, Int} could be better to be able to create a indexof function
I just noticed that your push! function does not return the original container you are pushing to, but one of its fields. Thus may lead to unexpected behavior.
Thanks a lot! Indices from Dictionaries.jl indeed seems to be exactly what I need.
For the sake of completeness, here is a comparison of how to perform linear indexing for both data structures:
The only concern I have is that “tokens” seem to be part of the interface describing how to implement new subtypes of AbstractIndices, rather than the public interface of Indices{T}. By which I mean that the docs don’t give me the impression that it would be guaranteed that tokentype(Indices{String}()) == Int or that the token values are contiguous integers starting from 1. But I’m not expecting that to change either, so it should be fine in practice.