Dictionaries.jl token API

Thanks @uniment. You raise a number of interesting points. I’ll need to keep my reply brief for now, so I’ll just reply to each in turn.

  1. Is there a reason the gettoken function returns a tuple containing whether the key is present? Why not just return a guaranteed-invalid token, which isassigned (or whatever token-checking function) can immediately recognize as invalid, and skip the boolean?

So currently the type of the token is up to the container. Users should be able to create new dictionaries which don’t have integer tokens, and I can’t predict what they are because they might be wrapping some database or distributed system that I’ve never heard of. So e.g. negative integers aren’t good for a generic interface.

From memory, using Union{Nothing, Int} for Dictionary (well, Indices) here was a bit slower in practice with the version of Julia originally used (maybe 1.0). Not sure if that still holds.

  1. Is there a reason the token is just a tuple? I get that that’s the minimal amount of information needed to communicate the dictionary access, but it could be nice to have a specialized DictToken type—then we could type-specialize getindex and setindex! on it, so we wouldn’t need special gettokenvalue and settokenvalue! functions and we could leverage [] index syntax. (I don’t see any reason that a dictionary must maintain the ability to hash DictTokens into keys; that seems like degenerate behavior :sweat_smile: not everything that’s hashable ought to be hashed). This is basically the hashmap dual of the idea to access arrays by ordinal indices, which was what inspired this thread.
  2. There seems to be something I don’t understand about tokens: why is t ∉ tokens(d)? And why does iterating tokens(d) yield something so different from what’s shown by its show method?

OK, so I was dithering whether the token for Indices should be one integer or two, and it’s a bit confusing that it has two “flavours” of token. In some circumstances you might want a token capable of deletion (like for unset!, faster with two integers) and other circumstances you may not (e.g. iteration, faster with one integer) and as a result the current implementation is a bit ugly.

As for having a token struct type (like we have CartesionIndex etc) here is another choice. For simplicity I tend to go for a “structural” typing style, but that’s not a necessary choice. One design choice being explored in Dictionaries.jl is explicitly separating the interface for indexing via indices vs indexing via tokens, as well as ensuring dictionaries can have keys of every type (e.g. I can create an Indices{TokenType}, and Indices{Any} works the same way for every key value).

  1. It seems like a good way to shoot ourselves in the foot is by keeping an old token and attempting to use it after a dictionary has had an insertion or deletion. Is there any zero-cost way to store information in the DictToken that could be used to indicate whether it is still valid, so that attempting to use an outdated token will throw an error?

Correct, you need to be very careful keeping tokens. The mental model is that they are invalidated whenever the indices are mutated. I don’t know any way to check this; perhaps using the age parameter from Base.Dict would help.

Note that the situation is the same with Vector - if you use an old index after changing the size of the vector you can get a bounds error. Combined with @inbounds you can get segfaults. The end user doesn’t even need an @inbounds annotation for that to occur, they could just mutate the vector in a closure passed to a higher-order function, for example (higher-order functions like map won’t expect the closure to have side-effects on the input).

  1. Is there a better way to check the validity of tokens that I’m completely missing?

If ever in doubt, get a new token with gettoken.

  1. For ordered dictionaries, would it be meaningful to be able to construct token ranges? For example, I’m imagining that one could write d[token1:token2] and access all elements of the dictionary between those tokens.

I’d love something for that. One thing I need to find the time to explore is dictionaries that use sort-order for lookup (things like B-trees, red-black trees, LSM trees, etc). Then taking key-slices is like taking token-slices. Note that the tokens may be sparse, and you’d want something more like gettokens(d, token1..token2) (where .. is from Intervals.jl). I haven’t figured all that out yet, but you can implement lots of interesting algorithms with sort-based indexing and range-based lookups (it’s the majority of what powers SQL DBs, for example).

Keep in mind I consider all the current token API implementation details to be in flux until a few more dictionary types (like those using key’s sort-order internally) are implemented and tested against the interface.

Finally, I’d like your opinion if you don’t mind:

  1. It seems like ordered dictionaries offer a lot of benefits. Would it make sense to introduce an ordered dictionary into Base?

People’s opinion seems to differ. Python did it in version 3.6 or so, with the argument that it reduces the chance of bugs in users code. Personally, I am sympathetic to that argument.

  1. If a scheme for ordinal indexing of abstract arrays is introduced, would you think it a good idea to also use it for ordered dictionaries?

Possibly, yes. Note that ordinal lookup on Dictionary is also slow if there are any holes (deletions), so it may fall back to iterating at run-time and that may surprise users (e.g. tests & benchmarks look fast, but production code runs at O(N^2)…).

Thanks for your interest.

4 Likes