Thanks @uniment. You raise a number of interesting points. I’ll need to keep my reply brief for now, so I’ll just reply to each in turn.
- Is there a reason the
gettoken
function returns a tuple containing whether the key is present? Why not just return a guaranteed-invalid token, whichisassigned
(or whatever token-checking function) can immediately recognize as invalid, and skip the boolean?
So currently the type of the token is up to the container. Users should be able to create new dictionaries which don’t have integer tokens, and I can’t predict what they are because they might be wrapping some database or distributed system that I’ve never heard of. So e.g. negative integers aren’t good for a generic interface.
From memory, using Union{Nothing, Int}
for Dictionary
(well, Indices
) here was a bit slower in practice with the version of Julia originally used (maybe 1.0). Not sure if that still holds.
- Is there a reason the token is just a tuple? I get that that’s the minimal amount of information needed to communicate the dictionary access, but it could be nice to have a specialized
DictToken
type—then we could type-specializegetindex
andsetindex!
on it, so we wouldn’t need specialgettokenvalue
andsettokenvalue!
functions and we could leverage[]
index syntax. (I don’t see any reason that a dictionary must maintain the ability to hashDictToken
s into keys; that seems like degenerate behavior not everything that’s hashable ought to be hashed). This is basically the hashmap dual of the idea to access arrays by ordinal indices, which was what inspired this thread.- There seems to be something I don’t understand about
tokens
: why ist ∉ tokens(d)
? And why does iteratingtokens(d)
yield something so different from what’s shown by itsshow
method?
OK, so I was dithering whether the token for Indices
should be one integer or two, and it’s a bit confusing that it has two “flavours” of token. In some circumstances you might want a token capable of deletion (like for unset!
, faster with two integers) and other circumstances you may not (e.g. iteration, faster with one integer) and as a result the current implementation is a bit ugly.
As for having a token struct
type (like we have CartesionIndex
etc) here is another choice. For simplicity I tend to go for a “structural” typing style, but that’s not a necessary choice. One design choice being explored in Dictionaries.jl is explicitly separating the interface for indexing via indices vs indexing via tokens, as well as ensuring dictionaries can have keys of every type (e.g. I can create an Indices{TokenType}
, and Indices{Any}
works the same way for every key value).
- It seems like a good way to shoot ourselves in the foot is by keeping an old token and attempting to use it after a dictionary has had an insertion or deletion. Is there any zero-cost way to store information in the
DictToken
that could be used to indicate whether it is still valid, so that attempting to use an outdated token will throw an error?
Correct, you need to be very careful keeping tokens. The mental model is that they are invalidated whenever the indices are mutated. I don’t know any way to check this; perhaps using the age
parameter from Base.Dict
would help.
Note that the situation is the same with Vector
- if you use an old index after changing the size of the vector you can get a bounds error. Combined with @inbounds
you can get segfaults. The end user doesn’t even need an @inbounds
annotation for that to occur, they could just mutate the vector in a closure passed to a higher-order function, for example (higher-order functions like map
won’t expect the closure to have side-effects on the input).
- Is there a better way to check the validity of tokens that I’m completely missing?
If ever in doubt, get a new token with gettoken
.
- For ordered dictionaries, would it be meaningful to be able to construct token ranges? For example, I’m imagining that one could write
d[token1:token2]
and access all elements of the dictionary between those tokens.
I’d love something for that. One thing I need to find the time to explore is dictionaries that use sort-order for lookup (things like B-trees, red-black trees, LSM trees, etc). Then taking key-slices is like taking token-slices. Note that the tokens may be sparse, and you’d want something more like gettokens(d, token1..token2)
(where ..
is from Intervals.jl). I haven’t figured all that out yet, but you can implement lots of interesting algorithms with sort-based indexing and range-based lookups (it’s the majority of what powers SQL DBs, for example).
Keep in mind I consider all the current token API implementation details to be in flux until a few more dictionary types (like those using key’s sort-order internally) are implemented and tested against the interface.
Finally, I’d like your opinion if you don’t mind:
- It seems like ordered dictionaries offer a lot of benefits. Would it make sense to introduce an ordered dictionary into
Base
?
People’s opinion seems to differ. Python did it in version 3.6 or so, with the argument that it reduces the chance of bugs in users code. Personally, I am sympathetic to that argument.
- If a scheme for ordinal indexing of abstract arrays is introduced, would you think it a good idea to also use it for ordered dictionaries?
Possibly, yes. Note that ordinal lookup on Dictionary
is also slow if there are any holes (deletions), so it may fall back to iterating at run-time and that may surprise users (e.g. tests & benchmarks look fast, but production code runs at O(N^2)…).
Thanks for your interest.