[ANN] DIctionaries.jl 0.3.0 - now using ordered collections by default

I’d like to announce the release of Dictionaries.jl v0.3.0.

The headline changes are that dictionaries and indices now have a well-defined order (just like in OrderedCollections.jl) and that HashDictionary has been renamed to Dictionary for brevity and to better match Base.

The creation of a brand new hash-based dictionary has been a tonne of work, and I’m glad to be finally releasing it. It’s design is inspired by the ordered hash dictionary released in CPython 3.6. I believe it is well tested but if you come across any problems please raise an issue.

Strongly ordered indexable containers

When developing for dictionaries (and their indices) it has been a bit of an open question (to me) whether the iteration order is semantically important. For example, if indices are like sets, should they use a set-based equality?

I have settled on making the order of a container an important feature. When new elements are added via insert! (or merge, union, set!, etc) they are always appended to the back. Comparison via == now also relies on the order being the same, so that Indices([1,2]) != Indices([2,1]). One can use issetequal and the newly created isdictequal functions to check for equality ignoring ordering.

This is important for many reasons and will enable future work. In the new version, map and filter is guaranteed to preserve the ordering. Reasoning about your code should be easier (there is less “undefined behavior”). Dictionaries will be able to be serialized and deserialized deterministically (and remote nodes will be able to agree on ordering semantics). We will be able to implement reverse and sort and permute! on dictionaries, and split dictionaries in half, and support a whole range of other useful operations and optimizations in the future. I am particularly excited about using this style of dictionary in the context of tables (e.g. the columns of a DataFrame are an ordered dictionary which shares it’s indices with the rows, and we also can think of tables with a primary key where the row indices are not 1:n but the values of the primary key itself).

As it stands, performance of bulk “analytical” style operations has improved signficantly compared to Dictionaries 0.2, to the point that the speed of map on Dictionaries now typically matches Vector! (Yay!) The trade off is that the new container is a little slower at random access, insertion and deletion, I suspect due to the additional redirection. I would hope this can be improved in the future but I do not know.

Renaming

HashDictionary is now Dictionary and HashIndices is now Indices. Not only is this less typing (phew!) but also it also more closely resembles Base. The old Indices and Dictionary have been tweaked and renamed to ArrayIndices and ArrayDictionary respectively, to better indicate their nature. I wanted to make this breaking change as early as possible (sorry if this inconveniences anyone).

New features

Along with the aforementioned isdictequal predicate function, there are new functions disjoint, distinct, index, and improvements to constructors and dictionary.

The disjoint function is a predicate (returning true or false) that detects if there is any overlap between two collections according to isequal. The best thing is that it is generic and not just for dictionaries!

julia> disjoint([1,2],[2,3])
false

julia> disjoint([1,2],[3,4])
true

The distinct function is like unique except it returns an Indices instead of a Vector. Internally, the unique function in base constructs a Set and then throws away the result. However, the expensive-to-create hashmap can be useful in future operations, and there is no reason to throw it away now that the values inside Indices are stored as a dense Vector.

julia> distinct([1,2,3,3])
3-element Indices{Int64}
 1
 2
 3

The index(f, iterable) function is an analogue to unique(f, iterable), which passes elements through the “key function” f to determine their uniqueness. Such an operation can be used to simply accelerate the uniqueness operation (if the values are complex fields containing a simple identifier or rows of a table containing a primary key column) or can be used to reduce datasets in some way. The resulting dictionary (and it’s indices) can be used into the future.

julia> index(first, ["Alice", "Bob", "Charlie"])
3-element Dictionary{Char,String}
 'A' │ "Alice"
 'B' │ "Bob"
 'C' │ "Charlie"

julia> map(uppercase, ans)
3-element Dictionary{Char,String}
 'A' │ "ALICE"
 'B' │ "BOB"
 'C' │ "CHARLIE"

Minor improvements and bugfixes

I won’t list them all, but now everything is much more consistent and thoroughly checked and hopefully easier to use. The constructors for indices and dictionaries do not allow for repeated indices. The dictionary function can now be used with 2-tuples from zip rather than Pairs, for example:

julia> dictionary(zip(["a","b","c"], [1,2,3]))
3-element Dictionary{String,Int64}
 "a" │ 1
 "b" │ 2
 "c" │ 3

The single argument constructors now accept arbitrary indexable containers, copying their keys and values (including Dict from Base). Coincidentally, you can now convert a dict::Dict to a Dictionary through either the Dictionary(dict) constructor or the dictionary(dict) function.

As always, I appreciate any feedback or ideas from the community - and I hope this might be useful for some of you!

28 Likes

wow! this will be amazing. Huge step forward! Looking forward to the wake of improvements in all the packages that will depend on this. Awesome work!

2 Likes

Excellent work!

Doesn’t the equality predicate depend on the collection type? Because disjoint([NaN], [NaN]) returns true :slight_smile:
Also, given that Base.isdisjoint now exists, it’s probably worth either merging disjoint with it or documenting explicitly the difference.

3 Likes

Yes, it depends on the behavior of in. For AbstractSet and AbstractIndices, where I recommend you do your set logic, it should be a isequal check.

Ooh shiny! Thank you for pointing that out - I haven’t been using development versions of Julia very much lately. Dictionaries v0.3.1 will deprecate disjoint in favour of Base.isdisjoint on Julia 1.5 onwards.

5 Likes

Is there an UnorderedDictionary somewhere that iterates quickly?

An unordered dictionary generally lacks the indirection between the hash table and a dense array that is fast to iterate over, so I’m not aware how you’d have both.

The advantage of not having the layer of indirection is that insertion/deletion/lookup is ever-so-slightly faster. The improvement in iteration speed with the indirection seems to be much larger than the slowdown in these operations, though, so it’s seems like an OK trade off to me.

As fas as I’m aware you generally need to choose one or the other, depending on your use case.

That said - note that in Dictionaries.jl in the contrib directory there is a perfectly functional unordered dictionary like Dict that supports the (much) faster co-iteration, map, similar, etc that you get from the Dictionaries.jl interface. I hope to promote it to the package proper but haven’t figured out the traits etc to make generic code understanding the dictionary properties possible.

3 Likes