Dictionaries.jl token API

uniment · December 30, 2022, 11:06am

Reading the Dictionaries.jl readme, it looks like a really nice package that lot of effort went into. Very glad to have something like this. Thank you for your contribution!

I’ve begun to play with it a bit to learn more about tokens, and I ran into the feeling that the interface could be more ergonomic.

Consider the following:

julia> using Dictionaries

julia> d = Dictionary('a':'c', 1:3)
3-element Dictionary{Char, Int64}
 'a' │ 1
 'b' │ 2
 'c' │ 3

julia> iskey, t = gettoken(d, 'b')
(true, (15, 2))

julia> iskey && gettokenvalue(d, t)
2

julia> iskey && settokenvalue!(d, t, 5)
3-element Dictionary{Char, Int64}
 'a' │ 1
 'b' │ 5
 'c' │ 3

julia> t ∈ tokens(d) # what?
false

julia> d[t] # this should work
ERROR: MethodError: Cannot `convert` an object of type Tuple{Int64, Int64} to an object of type Char

Questions:

Is there a reason the gettoken function returns a tuple containing whether the key is present? Why not just return a guaranteed-invalid token, which isassigned (or whatever token-checking function) can immediately recognize as invalid, and skip the boolean?
Is there a reason the token is just a tuple? I get that that’s the minimal amount of information needed to communicate the dictionary access, but it could be nice to have a specialized DictToken type—then we could type-specialize getindex and setindex! on it, so we wouldn’t need special gettokenvalue and settokenvalue! functions and we could leverage [] index syntax. (I don’t see any reason that a dictionary must maintain the ability to hash DictTokens into keys; that seems like degenerate behavior not everything that’s hashable ought to be hashed). This is basically the hashmap dual of the idea to access arrays by ordinal indices, which was what inspired this thread.
There seems to be something I don’t understand about tokens: why is t ∉ tokens(d)? And why does iterating tokens(d) yield something so different from what’s shown by its show method?

julia> t
(15, 2)

julia> tokens(d)
3-element Dictionaries.IndicesTokens{Char, Int64, Indices{Char}}
 'a' │ (8, 1)
 'b' │ (15, 2)
 'c' │ (9, 3)

julia> t ∈ tokens(d) # we can't use this to check token validity
false

julia> (tokens(d)...,) # what?
((0, 1), (0, 2), (0, 3))

Now, consider the following extension to the above example:

julia> _, tb = gettoken(d, 'b'); _, tc = gettoken(d, 'c');

julia> unset!(d, 'b')
2-element Dictionary{Char, Int64}
 'a' │ 1
 'c' │ 3

julia> istokenassigned(d, tb), istokenassigned(d, tc)
(true, false)

julia> set!(d, 'b', 2)
3-element Dictionary{Char, Int64}
 'a' │ 1
 'c' │ 3
 'b' │ 2

julia> istokenassigned(d, tb), istokenassigned(d, tc)
(true, true)

julia> gettokenvalue(d, tb), gettokenvalue(d, tc) # what?
(3, 2)

More questions:

It seems like a good way to shoot ourselves in the foot is by keeping an old token and attempting to use it after a dictionary has had an insertion or deletion. Is there any zero-cost way to store information in the DictToken that could be used to indicate whether it is still valid, so that attempting to use an outdated token will throw an error?
Is there a better way to check the validity of tokens that I’m completely missing?
For ordered dictionaries, would it be meaningful to be able to construct token ranges? For example, I’m imagining that one could write d[token1:token2] and access all elements of the dictionary between those tokens.

I would imagine a more ergonomic interface to work like this:

#= simulation =#
d = Dictionary('a':'e', 1:5)
t = gettoken(d, 'c')
if isassigned(d, t)  d[t] = d[t] + 1  end

and possibly this for ordered dictionaries:

# currying form of gettoken? just spitballing...
gt = gettoken(d)
trange = gt('b'):gt('d')
isassigned(d, trange) && 
    map(d[trange]) do x; x^2 end

Finally, I’d like your opinion if you don’t mind:

It seems like ordered dictionaries offer a lot of benefits. Would it make sense to introduce an ordered dictionary into Base?
If a scheme for ordinal indexing of abstract arrays is introduced, would you think it a good idea to also use it for ordered dictionaries?

Thanks again!

mkitti · December 30, 2022, 4:13pm

Most of this seems like it should be a Github issue on the Dictionaries.jl repository.

Note that we also have an implementation of ordered dictionaries in OrderedCollections.jl:

It would be best to keep a specific ordered dictionary implementation out of Base for now. It is much harder to iterate on the design if it is in Base. The main criteria I would consider for inclusion in Base is does it have to be there? If it works well in a package without type piracy, then it probably should not be in Base.
Test this design in a package. The ordinal indexing could go into its own package. Then you can create a series of composition packages where you combine ordinal indexing with other packages.

andyferris · December 30, 2022, 11:10pm

Thanks @uniment. You raise a number of interesting points. I’ll need to keep my reply brief for now, so I’ll just reply to each in turn.

Is there a reason the gettoken function returns a tuple containing whether the key is present? Why not just return a guaranteed-invalid token, which isassigned (or whatever token-checking function) can immediately recognize as invalid, and skip the boolean?

So currently the type of the token is up to the container. Users should be able to create new dictionaries which don’t have integer tokens, and I can’t predict what they are because they might be wrapping some database or distributed system that I’ve never heard of. So e.g. negative integers aren’t good for a generic interface.

From memory, using Union{Nothing, Int} for Dictionary (well, Indices) here was a bit slower in practice with the version of Julia originally used (maybe 1.0). Not sure if that still holds.

Is there a reason the token is just a tuple? I get that that’s the minimal amount of information needed to communicate the dictionary access, but it could be nice to have a specialized DictToken type—then we could type-specialize getindex and setindex! on it, so we wouldn’t need special gettokenvalue and settokenvalue! functions and we could leverage [] index syntax. (I don’t see any reason that a dictionary must maintain the ability to hash DictTokens into keys; that seems like degenerate behavior not everything that’s hashable ought to be hashed). This is basically the hashmap dual of the idea to access arrays by ordinal indices, which was what inspired this thread.

There seems to be something I don’t understand about tokens: why is t ∉ tokens(d)? And why does iterating tokens(d) yield something so different from what’s shown by its show method?

OK, so I was dithering whether the token for Indices should be one integer or two, and it’s a bit confusing that it has two “flavours” of token. In some circumstances you might want a token capable of deletion (like for unset!, faster with two integers) and other circumstances you may not (e.g. iteration, faster with one integer) and as a result the current implementation is a bit ugly.

As for having a token struct type (like we have CartesionIndex etc) here is another choice. For simplicity I tend to go for a “structural” typing style, but that’s not a necessary choice. One design choice being explored in Dictionaries.jl is explicitly separating the interface for indexing via indices vs indexing via tokens, as well as ensuring dictionaries can have keys of every type (e.g. I can create an Indices{TokenType}, and Indices{Any} works the same way for every key value).

It seems like a good way to shoot ourselves in the foot is by keeping an old token and attempting to use it after a dictionary has had an insertion or deletion. Is there any zero-cost way to store information in the DictToken that could be used to indicate whether it is still valid, so that attempting to use an outdated token will throw an error?

Correct, you need to be very careful keeping tokens. The mental model is that they are invalidated whenever the indices are mutated. I don’t know any way to check this; perhaps using the age parameter from Base.Dict would help.

Note that the situation is the same with Vector - if you use an old index after changing the size of the vector you can get a bounds error. Combined with @inbounds you can get segfaults. The end user doesn’t even need an @inbounds annotation for that to occur, they could just mutate the vector in a closure passed to a higher-order function, for example (higher-order functions like map won’t expect the closure to have side-effects on the input).

Is there a better way to check the validity of tokens that I’m completely missing?

If ever in doubt, get a new token with gettoken.

For ordered dictionaries, would it be meaningful to be able to construct token ranges? For example, I’m imagining that one could write d[token1:token2] and access all elements of the dictionary between those tokens.

I’d love something for that. One thing I need to find the time to explore is dictionaries that use sort-order for lookup (things like B-trees, red-black trees, LSM trees, etc). Then taking key-slices is like taking token-slices. Note that the tokens may be sparse, and you’d want something more like gettokens(d, token1..token2) (where .. is from Intervals.jl). I haven’t figured all that out yet, but you can implement lots of interesting algorithms with sort-based indexing and range-based lookups (it’s the majority of what powers SQL DBs, for example).

Keep in mind I consider all the current token API implementation details to be in flux until a few more dictionary types (like those using key’s sort-order internally) are implemented and tested against the interface.

Finally, I’d like your opinion if you don’t mind:

It seems like ordered dictionaries offer a lot of benefits. Would it make sense to introduce an ordered dictionary into Base?

People’s opinion seems to differ. Python did it in version 3.6 or so, with the argument that it reduces the chance of bugs in users code. Personally, I am sympathetic to that argument.

If a scheme for ordinal indexing of abstract arrays is introduced, would you think it a good idea to also use it for ordered dictionaries?

Possibly, yes. Note that ordinal lookup on Dictionary is also slow if there are any holes (deletions), so it may fall back to iterating at run-time and that may surprise users (e.g. tests & benchmarks look fast, but production code runs at O(N^2)…).

Thanks for your interest.

Stephen_Vavasis · December 31, 2022, 12:39am

Just to note that the package DataStructures.jl already has sorted dictionaries and a token interface for iterating over entries. It also supports some of the items mentioned in these posts such as token1:token2 range indexing.

uniment · December 31, 2022, 3:26am

Yes, this makes sense. My thinking was, for example supposing (0,0) is the invalid token, that a token could be invalidated with an egal test tpl === (0,0). The thought was that egal comparisons on a bitstype like this should be just as performant as using the boolean currently returned by gettoken; I believe this holds as long as we’re not constructing a large collection of tokens (and being bottlenecked by memory access). My understanding of the usecases for tokens suggests this is ok.

This is okay I think, so long as the interface for a AbstractDictionaryToken is defined to permit this.

For example, we might specify that Base.isvalid(t::AbstractDictionaryToken) be a method of its interface. For a token implementation struct MyToken<:AbstractDictionaryToken t::Tuple{Int,Int} end, the test might be Base.isvalid(t::MyToken) = !(t.t === (0,0)).

Ah, that explains it. I was so confused

Making a AbstractDictionaryToken type would allow you to accommodate both, although I’m not sure if the extra interface complexity is worthwhile.

Indexing via indices sounds like a perfect application of ordinal indices.

True, but I guess what’s different is that I never ask the vector for an object to access its elements, and if I want to check if an index is valid I can query length or size (or eachindex or axes), and I know exactly how to do the math to check that it’s within those bounds. And even if it’s not idiomatic, I know that any time I access v[1] of a v::Vector I will always get the element at the first position. Tokens, on the other hand, are a property of the dictionary and its keys in a non-obvious way; so if those change, it feels like there ought to be a way to invalidate the token (ideally without giving up on the reason we created the token—to avoid an access).

I’m mostly worried about the type of bug that doesn’t throw an error, which produces correct results for a long time until an unusual sequence of inputs causes the data to be incorrect. It’s nicer if we can identify that the token is invalid and throw an error upfront.

Depending how expensive it is, the extra safety might or might not be worth it. It’s especially not worth it if it has nonzero cost and provides only an illusion of safety, in which case it’s better just to enforce safe habits and document well. But if there’s a low-cost way to add guardrails so I’m protected from myself, I find that attractive. (conceptually, guardrails could be part of the specification and less-safe collections could exist which are explicitly non-compliant for performance).

Thanks again!

uniment · December 31, 2022, 3:26am

Thanks, I’ll check it out!

Topic		Replies	Views
[ANN] Dictionaries.jl - Improved productivity and performance of dictionaries in Julia Package Announcements dictionary , dictionaries	22	6287	December 15, 2019
[ANN] DictionaryIndexing.jl Package Announcements dictionary	14	964	April 16, 2021
Performance of Dictionaries.jl vs Base.Dict Internals & Design package , performance , dictionary , dictionaries	32	3422	December 13, 2024
Replacement for Dict, is anyone using isslotfilled? General Usage	22	1957	October 27, 2020
ANN CardinalDicts.jl (fast get/set/clear/reset of 1:n keyed values) Community package	0	534	August 1, 2017

Dictionaries.jl token API

Related topics