I’d like to announce the release of Dictionaries.jl v0.3.0.
The headline changes are that dictionaries and indices now have a well-defined order (just like in OrderedCollections.jl) and that HashDictionary
has been renamed to Dictionary
for brevity and to better match Base
.
The creation of a brand new hash-based dictionary has been a tonne of work, and I’m glad to be finally releasing it. It’s design is inspired by the ordered hash dictionary released in CPython 3.6. I believe it is well tested but if you come across any problems please raise an issue.
Strongly ordered indexable containers
When developing for dictionaries (and their indices) it has been a bit of an open question (to me) whether the iteration order is semantically important. For example, if indices are like sets, should they use a set-based equality?
I have settled on making the order of a container an important feature. When new elements are added via insert!
(or merge
, union
, set!
, etc) they are always appended to the back. Comparison via ==
now also relies on the order being the same, so that Indices([1,2]) != Indices([2,1])
. One can use issetequal
and the newly created isdictequal
functions to check for equality ignoring ordering.
This is important for many reasons and will enable future work. In the new version, map
and filter
is guaranteed to preserve the ordering. Reasoning about your code should be easier (there is less “undefined behavior”). Dictionaries will be able to be serialized and deserialized deterministically (and remote nodes will be able to agree on ordering semantics). We will be able to implement reverse
and sort
and permute!
on dictionaries, and split dictionaries in half, and support a whole range of other useful operations and optimizations in the future. I am particularly excited about using this style of dictionary in the context of tables (e.g. the columns of a DataFrame
are an ordered dictionary which shares it’s indices with the rows, and we also can think of tables with a primary key where the row indices are not 1:n but the values of the primary key itself).
As it stands, performance of bulk “analytical” style operations has improved signficantly compared to Dictionaries 0.2, to the point that the speed of map
on Dictionaries
now typically matches Vector
! (Yay!) The trade off is that the new container is a little slower at random access, insertion and deletion, I suspect due to the additional redirection. I would hope this can be improved in the future but I do not know.
Renaming
HashDictionary
is now Dictionary
and HashIndices
is now Indices
. Not only is this less typing (phew!) but also it also more closely resembles Base
. The old Indices
and Dictionary
have been tweaked and renamed to ArrayIndices
and ArrayDictionary
respectively, to better indicate their nature. I wanted to make this breaking change as early as possible (sorry if this inconveniences anyone).
New features
Along with the aforementioned isdictequal
predicate function, there are new functions disjoint
, distinct
, index
, and improvements to constructors and dictionary
.
The disjoint
function is a predicate (returning true
or false
) that detects if there is any overlap between two collections according to isequal
. The best thing is that it is generic and not just for dictionaries!
julia> disjoint([1,2],[2,3])
false
julia> disjoint([1,2],[3,4])
true
The distinct
function is like unique
except it returns an Indices
instead of a Vector
. Internally, the unique
function in base constructs a Set
and then throws away the result. However, the expensive-to-create hashmap can be useful in future operations, and there is no reason to throw it away now that the values inside Indices
are stored as a dense Vector
.
julia> distinct([1,2,3,3])
3-element Indices{Int64}
1
2
3
The index(f, iterable)
function is an analogue to unique(f, iterable)
, which passes elements through the “key function” f
to determine their uniqueness. Such an operation can be used to simply accelerate the uniqueness operation (if the values are complex fields containing a simple identifier or rows of a table containing a primary key column) or can be used to reduce datasets in some way. The resulting dictionary (and it’s indices) can be used into the future.
julia> index(first, ["Alice", "Bob", "Charlie"])
3-element Dictionary{Char,String}
'A' │ "Alice"
'B' │ "Bob"
'C' │ "Charlie"
julia> map(uppercase, ans)
3-element Dictionary{Char,String}
'A' │ "ALICE"
'B' │ "BOB"
'C' │ "CHARLIE"
Minor improvements and bugfixes
I won’t list them all, but now everything is much more consistent and thoroughly checked and hopefully easier to use. The constructors for indices and dictionaries do not allow for repeated indices. The dictionary
function can now be used with 2-tuples from zip
rather than Pair
s, for example:
julia> dictionary(zip(["a","b","c"], [1,2,3]))
3-element Dictionary{String,Int64}
"a" │ 1
"b" │ 2
"c" │ 3
The single argument constructors now accept arbitrary indexable containers, copying their keys and values (including Dict
from Base
). Coincidentally, you can now convert a dict::Dict
to a Dictionary
through either the Dictionary(dict)
constructor or the dictionary(dict)
function.
As always, I appreciate any feedback or ideas from the community - and I hope this might be useful for some of you!