[ANN] Dictionaries.jl - Improved productivity and performance of dictionaries in Julia

andyferris · December 9, 2019, 10:42pm

I am pleased to announce the recent release of Dictionaries.jl, which provides an alternative interface for dictionaries in Julia, for improved productivity and performance. You can install it with ]add Dictionaries at the REPL.

The high-level goal of this package is to define a new interface for dictionary and set structures which is convenient and efficient for functional data manipulation - including operations such as non-scalar indexing, broadcasting, mapping, filtering, reducing, grouping, and so-on. While Julia comes with built-in AbstractDict and AbstractSet supertypes, the interfaces for these are not as well established or generic as for AbstractArray , the built-in dictionaries implement less of the common data manipulation operations compared to arrays, and it is difficult to work with them in a performant manner.

In this package we aim to devise a cohesive interface for abstract dictionaries (or associative maps), having the common supertype AbstractDictionary . A large part of this is working with indices (of arbitrary type) as well as convenient and efficient iteration of the containers. A second goal is to make dictionary manipulation more closely resemble array manipulation, to make it easier for users. Simultaneously, we are pushing the performance of working with dictionaries to be closer to that of working with arrays.

The largest departure from Base.AbstractDict that users should be aware of is that dictionaries here iterate values, not key-value pairs. In return, you get the convenience of using your usual tools for arrays, and some performance gains.

For a very small taste of how the Dictionaries.jl interface can lead to performance improvements during an analytic workflow, consider the following example where we attempt to add 1 to the values of a hash-based dictionary. Not only is this easier for the programmer (using broadcast notation) but is also much more performant than the typical pattern with Base.Dict. Of course, map, reduce, filter and friends all work out-of-the-box, too.

julia> using Dictionaries, BenchmarkTools

julia> dictionary = HashDictionary(rand(Int64, 10_000_000), rand(Int64, 10_000_000));

julia> @btime dictionary .+ 1;
  135.975 ms (22 allocations: 256.00 MiB)

julia> dict = Dict(rand(Int64) => rand(Int64) for _ in 1:10_000_000);

julia> @btime Dict(k => v+1 for (k,v) in dict);
  10.283 s (73 allocations: 541.17 MiB)

(Note - the thing impacting performance most here is that the Dict is constructed by inserting the elements and iteratively building a new hashmap, while broadcasting over the HashDictionary preserves the keys and hashmap, creating only new values in the valid slots).

This is a young package, and contributions are of course very welcome. I am currently working on integrating this with SplitApplyCombine.jl, particularly such that the group family of functions return a Dictionaries.jl dictionary.

For Julia Base developers, this package combines the ideas discussed during Julia 1.0 development, particulary the Juleps #24019 and #25904, the (unmerged) PRs #24508 and #25013, the non-scalar indexing proposal #30845 , the discussion of dictionary iteration in #22194 and a variety of other work which was merged at that time.

xiaodai · December 9, 2019, 11:06pm

What does that mean?

carstenbauer · December 9, 2019, 11:08pm

Plus one to all values. Did you read the text?

andyferris · December 9, 2019, 11:09pm

It broadcasts (maps) over the values but preserves the keys. Like this:

julia> d = HashDictionary(["a", "b", "c"], [1, 2, 3])
3-element HashDictionary{String,Int64}
 "c" │ 3
 "b" │ 2
 "a" │ 1

julia> d .+ 1
3-element HashDictionary{String,Int64}
 "c" │ 4
 "b" │ 3
 "a" │ 2

mkborregaard · December 9, 2019, 11:09pm

afaiu it creates a new dictionary that has the same hashmap as the old one, just incremented values
EDIT sorry crosspost, but deleting now is annoying in discourse so just ignore

xiaodai · December 9, 2019, 11:23pm

missed it while I skimmed it

Ratingulate · December 10, 2019, 2:56am

Cool. There’s a tension right now between compile cruft and performance (DF vs indexed tables). Do you see this improving on that tradeoff?

How far does the performance go? Can this interface generalize arrays and serve as a named dim array if wrapped in an abstract array struct?

StevenSiew · December 10, 2019, 3:11am

I have exactly 1.3 questions to ask.

Is this Dictionary Thread Safe?

c42f · December 10, 2019, 4:30am

Thread safe in what way? If you’re looking for a dictionary which is thread safe for parallel insertions and deletions you won’t find it in Dictionaries.jl for the same reason it’s not in Base — the overhead would be too high for single threaded use (see also Can dicts be threadsafe? - #6 by StefanKarpinski)

andyferris · December 10, 2019, 4:41am

Is this Dictionary Thread Safe?

Good question!

The HashDictionary is implemented the same as (i.e. the code is copied from) Base.Dict, but without the allowances for WeakKeyDict. Neither mutating its keys nor reading from it, while some other thread mutates its keys, would be safe at all. (I.e. use the rust rules when sharing it).

It makes me wonder… the AbstractDictionary interface is pretty generic, but the token stuff doesn’t necessarily support atomic key-and-value updates - this probably bears some further thought! There’s nothing stopping you creating a non-tokenized dictionary with appropriate safeguards though.

andyferris · December 10, 2019, 4:50am

How far does the performance go? Can this interface generalize arrays and serve as a named dim array if wrapped in an abstract array struct?

I think so? Thinking about multidimensional dictionaries is something I’d like to do in the future. The interface should be pretty low overhead and ammenable to compiler optimizations/zero-cost abstractions, like AbstractArray. It’s possible that inlining immutable references to the heap will help a little bit (https://github.com/JuliaLang/julia/pull/33886).

Cool. There’s a tension right now between compile cruft and performance (DF vs indexed tables). Do you see this improving on that tradeoff?

So this is about say “strongly typed” tables vs “weakly typed” dataframes? This probably doesn’t contribute a whole lot to that… just like AbstractArray, AbstractDictionary has a type parameter for a uniform eltype. That said, I suspect you could create a “static dictionary” with inference working out fine, which would be an interesting experiment (possibly just need to unroll the loop for getindex on Dictionary containing tuples, and hope constant propagation works for the keys tuple… unfortunately I’m not sure what “guarantees” the compiler provides for constant propagation).

tim.holy · December 10, 2019, 1:48pm

Really nice to see this come to life!

pontus · December 10, 2019, 3:10pm

Not sure if you’ve seen the package DataCubes.jlwhich provides DictArray, a multidimensional array whose element type is an ordered dictionary with common keys, and LabeledArray, a multidimensional array consisting of base and axes?
It hasn’t been maintained for a couple of years, looks as if last update was for Julia v0.6. It is MIT Licensed, and perhaps will give some ideas for when you’re ready to move in that direction?

also has nice documentation:
http://c-s.github.io/DataCubes.jl/introduction

Thanks for your efforts with this package!

tkf · December 11, 2019, 9:44am

Thanks, this is a great work on re-designing a coherent interface of dictionary!

I find the design choice that dict[key] = value means strict update interesting… in the sense I am so used to it to mean “upsert” it’s a bit hard to imagine how reading/writing code would be like. But I think I get the analogy to arrays and consistency with other parts of the interface.

In Dictionaries with the same indices, you are mentioning that

The similar function is used to create a dictionary with defined indices, but undefined values that can be set/mutated after the fact. similar(dict, T) creates a container with the same indices as dict and, optionally, a new element type.

I suppose this is intentionally like this also for broadcasting and mapping?:

julia> d1 = HashDictionary(["a", "b", "c"], [1, 2, 3])
3-element HashDictionary{String,Int64}
 "c" │ 3
 "b" │ 2
 "a" │ 1

julia> d2 = d1 .+ 1
3-element HashDictionary{String,Int64}
 "c" │ 4
 "b" │ 3
 "a" │ 2

julia> keys(d1) === keys(d2)
true

However, it feels undesirable property to me that insert!ing one of the dictionary mutates other ones that share the indices:

julia> insert!(d2, "d", 5)
4-element HashDictionary{String,Int64}
 "c" │ 4
 "b" │ 3
 "a" │ 2
 "d" │ 5

julia> d1
4-element HashDictionary{String,Int64}
 "c" │ 3
 "b" │ 2
 "a" │ 1
 "d" │ 0

Is there a plan to “fix” this (or is this expected)? Maybe using copy-on-write indices? Immutable/persistent data structure?

andyferris · December 11, 2019, 1:14pm

I find the design choice that dict[key] = value means strict update interesting… in the sense I am so used to it to mean “upsert” it’s a bit hard to imagine how reading/writing code would be like. But I think I get the analogy to arrays and consistency with other parts of the interface.

Yes this one is quite interesting. I fretted over this one for a while, but I did use Dictionaries.jl for a little while for some “real” work before releasing, and honestly I never missed this. On the other hand, the feeling of certainty that the keys aren’t shifting under you while you are mutating values is quite comforting and very much seemed worth it to me (and set! is a nice partner to get!).

I suppose this is intentionally like this also for broadcasting and mapping?:

Yeah, generic functions delegate to similar and empty as factories to materialize new containers, so anything that preserves keys ends up going through similar. This is quite nice because I found I frequently ended up using fast co-iteration with the input and output, and also I rarely (if ever?) needed to mutate the keys of the output in practice.

Is there a plan to “fix” this (or is this expected)? Maybe using copy-on-write indices? Immutable/persistent data structure?

Yes, they are both valid options. My personal favorite is freeze from #31630 - I’d possibly have a freezekeys or freezeindices thing that you call first in order to get what is the current behavior, and revert similar on mutable keys to making a copy. We should fix this, for sure - but I feel this is the exact same problem with SubArray (mutating the indexer part can cause segfaults, since the bounds check on the indexer values are done at construction time, and the same goes for resizing the indexee) which we should fix first (and using freeze on the indexer and freezekeys on the indexee is only the way to achieve that which I have thought of, while I’m hoping the compiler can realize copying is a not needed in the majority of cases via escape analysis/rust-style borrow checker) . Currently as it stands, consider Dictionaries.jl to be a sharp tool.

tkf · December 12, 2019, 12:01am

Does it mean there would be (say) MutableIndices and FrozenIndices types? Reading your comments in #31630, I now can see that it’d work quite nicely with dict[key] = value being strict update; i.e., only dictionaries with MutableIndices support insert! while both types of dictionary support dict[key] = value.

andyferris · December 12, 2019, 2:56am

We could make our own types, yes. Or else wait for frozen arrays and freeze the arrays underlying the hashmap. I’m looking for feedback on this one, actually, so opinions welcome.

tkf · December 12, 2019, 4:00am

Skimming #31630, I thought you can just go ahead and define custom mutable and immutable indices types for now and then swap its internal based on Base.freeze/Base.melt, once #31630 is merged. (And also maybe add unsafe_freeze API to do “manual borrow checker” to avoid copying until #31630 arrives?) This, of course, relies on that the design of #31630 does not deviate much from the current version. But do you see pitfalls other than that?

Zach_Christensen · December 12, 2019, 11:09am

A while ago I started looking at improving indexing for some personal projects. The thing I got stuck on was the need to have distinct static and mutable ranges, which is why I developed https://github.com/Tokazama/StaticRanges.jl. I’ve also been trying to get the filtering/find/search stuff optimized for it too. This would only help with an index that can be conceptualized as a range, but I think that’s the most common use case (at least for me).

If you’re interested in using it I’d love to take a look at integrating it with Dictionaries.jl. It’s not fully optimized yet but I’ve included nearly all the tests on ranges in base and added many others to test some of the novel functionalities.

anon92994695 · December 12, 2019, 11:19am

It’s efforts like these, and say DataStructures.jl that I feel should just be included in Base Julia. Then again, I see why they aren’t, but now that I think about it, no I don’t. Awesome idea though!

Topic		Replies	Views
Performance of Dictionaries.jl vs Base.Dict Internals & Design package , performance , dictionary , dictionaries	32	3226	December 13, 2024
[ANN] Dictionaries.jl 0.3.0 - now using ordered collections by default Package Announcements dictionary	5	1372	June 13, 2021
Dictionaries.jl token API General Usage	5	1008	December 31, 2022
[ANN] DictionaryIndexing.jl Package Announcements dictionary	14	899	April 16, 2021
Poor time performance on Dict? Performance	26	19099	March 12, 2018

[ANN] Dictionaries.jl - Improved productivity and performance of dictionaries in Julia

Related topics