[ANN] Dictionaries.jl - Improved productivity and performance of dictionaries in Julia

It broadcasts (maps) over the values but preserves the keys. Like this:

julia> d = HashDictionary(["a", "b", "c"], [1, 2, 3])
3-element HashDictionary{String,Int64}
 "c" │ 3
 "b" │ 2
 "a" │ 1

julia> d .+ 1
3-element HashDictionary{String,Int64}
 "c" │ 4
 "b" │ 3
 "a" │ 2
4 Likes

afaiu it creates a new dictionary that has the same hashmap as the old one, just incremented values
EDIT sorry crosspost, but deleting now is annoying in discourse so just ignore

1 Like

missed it while I skimmed it

Cool. There’s a tension right now between compile cruft and performance (DF vs indexed tables). Do you see this improving on that tradeoff?

How far does the performance go? Can this interface generalize arrays and serve as a named dim array if wrapped in an abstract array struct?

1 Like

I have exactly 1.3 questions to ask.

Is this Dictionary Thread Safe?

4 Likes

Thread safe in what way? If you’re looking for a dictionary which is thread safe for parallel insertions and deletions you won’t find it in Dictionaries.jl for the same reason it’s not in Base — the overhead would be too high for single threaded use (see also Can dicts be threadsafe?)

6 Likes

Is this Dictionary Thread Safe?

Good question!

The HashDictionary is implemented the same as (i.e. the code is copied from) Base.Dict, but without the allowances for WeakKeyDict. Neither mutating its keys nor reading from it, while some other thread mutates its keys, would be safe at all. (I.e. use the rust rules when sharing it).

It makes me wonder… the AbstractDictionary interface is pretty generic, but the token stuff doesn’t necessarily support atomic key-and-value updates - this probably bears some further thought! There’s nothing stopping you creating a non-tokenized dictionary with appropriate safeguards though.

4 Likes

How far does the performance go? Can this interface generalize arrays and serve as a named dim array if wrapped in an abstract array struct?

I think so? Thinking about multidimensional dictionaries is something I’d like to do in the future. The interface should be pretty low overhead and ammenable to compiler optimizations/zero-cost abstractions, like AbstractArray. It’s possible that inlining immutable references to the heap will help a little bit (https://github.com/JuliaLang/julia/pull/33886).

Cool. There’s a tension right now between compile cruft and performance (DF vs indexed tables). Do you see this improving on that tradeoff?

So this is about say “strongly typed” tables vs “weakly typed” dataframes? This probably doesn’t contribute a whole lot to that… just like AbstractArray, AbstractDictionary has a type parameter for a uniform eltype. That said, I suspect you could create a “static dictionary” with inference working out fine, which would be an interesting experiment (possibly just need to unroll the loop for getindex on Dictionary containing tuples, and hope constant propagation works for the keys tuple… unfortunately I’m not sure what “guarantees” the compiler provides for constant propagation).

1 Like

Really nice to see this come to life!

4 Likes

Not sure if you’ve seen the package DataCubes.jlwhich provides DictArray, a multidimensional array whose element type is an ordered dictionary with common keys, and LabeledArray, a multidimensional array consisting of base and axes?
It hasn’t been maintained for a couple of years, looks as if last update was for Julia v0.6. It is MIT Licensed, and perhaps will give some ideas for when you’re ready to move in that direction?


also has nice documentation:
http://c-s.github.io/DataCubes.jl/introduction

Thanks for your efforts with this package!

3 Likes

Thanks, this is a great work on re-designing a coherent interface of dictionary!

I find the design choice that dict[key] = value means strict update interesting… in the sense I am so used to it to mean “upsert” it’s a bit hard to imagine how reading/writing code would be like. But I think I get the analogy to arrays and consistency with other parts of the interface.

In Dictionaries with the same indices, you are mentioning that

The similar function is used to create a dictionary with defined indices, but undefined values that can be set/mutated after the fact. similar(dict, T) creates a container with the same indices as dict and, optionally, a new element type.

I suppose this is intentionally like this also for broadcasting and mapping?:

julia> d1 = HashDictionary(["a", "b", "c"], [1, 2, 3])
3-element HashDictionary{String,Int64}
 "c" │ 3
 "b" │ 2
 "a" │ 1

julia> d2 = d1 .+ 1
3-element HashDictionary{String,Int64}
 "c" │ 4
 "b" │ 3
 "a" │ 2

julia> keys(d1) === keys(d2)
true

However, it feels undesirable property to me that insert!ing one of the dictionary mutates other ones that share the indices:

julia> insert!(d2, "d", 5)
4-element HashDictionary{String,Int64}
 "c" │ 4
 "b" │ 3
 "a" │ 2
 "d" │ 5

julia> d1
4-element HashDictionary{String,Int64}
 "c" │ 3
 "b" │ 2
 "a" │ 1
 "d" │ 0

Is there a plan to “fix” this (or is this expected)? Maybe using copy-on-write indices? Immutable/persistent data structure?

4 Likes

I find the design choice that dict[key] = value means strict update interesting… in the sense I am so used to it to mean “upsert” it’s a bit hard to imagine how reading/writing code would be like. But I think I get the analogy to arrays and consistency with other parts of the interface.

Yes this one is quite interesting. I fretted over this one for a while, but I did use Dictionaries.jl for a little while for some “real” work before releasing, and honestly I never missed this. On the other hand, the feeling of certainty that the keys aren’t shifting under you while you are mutating values is quite comforting and very much seemed worth it to me (and set! is a nice partner to get!).

I suppose this is intentionally like this also for broadcasting and mapping?:

Yeah, generic functions delegate to similar and empty as factories to materialize new containers, so anything that preserves keys ends up going through similar. This is quite nice because I found I frequently ended up using fast co-iteration with the input and output, and also I rarely (if ever?) needed to mutate the keys of the output in practice.

Is there a plan to “fix” this (or is this expected)? Maybe using copy-on-write indices? Immutable/persistent data structure?

Yes, they are both valid options. My personal favorite is freeze from #31630 - I’d possibly have a freezekeys or freezeindices thing that you call first in order to get what is the current behavior, and revert similar on mutable keys to making a copy. We should fix this, for sure - but I feel this is the exact same problem with SubArray (mutating the indexer part can cause segfaults, since the bounds check on the indexer values are done at construction time, and the same goes for resizing the indexee) which we should fix first (and using freeze on the indexer and freezekeys on the indexee is only the way to achieve that which I have thought of, while I’m hoping the compiler can realize copying is a not needed in the majority of cases via escape analysis/rust-style borrow checker) . Currently as it stands, consider Dictionaries.jl to be a sharp tool.

3 Likes

Does it mean there would be (say) MutableIndices and FrozenIndices types? Reading your comments in #31630, I now can see that it’d work quite nicely with dict[key] = value being strict update; i.e., only dictionaries with MutableIndices support insert! while both types of dictionary support dict[key] = value.

1 Like

We could make our own types, yes. Or else wait for frozen arrays and freeze the arrays underlying the hashmap. I’m looking for feedback on this one, actually, so opinions welcome.

1 Like

Skimming #31630, I thought you can just go ahead and define custom mutable and immutable indices types for now and then swap its internal based on Base.freeze/Base.melt, once #31630 is merged. (And also maybe add unsafe_freeze API to do “manual borrow checker” to avoid copying until #31630 arrives?) This, of course, relies on that the design of #31630 does not deviate much from the current version. But do you see pitfalls other than that?

1 Like

A while ago I started looking at improving indexing for some personal projects. The thing I got stuck on was the need to have distinct static and mutable ranges, which is why I developed https://github.com/Tokazama/StaticRanges.jl. I’ve also been trying to get the filtering/find/search stuff optimized for it too. This would only help with an index that can be conceptualized as a range, but I think that’s the most common use case (at least for me).

If you’re interested in using it I’d love to take a look at integrating it with Dictionaries.jl. It’s not fully optimized yet but I’ve included nearly all the tests on ranges in base and added many others to test some of the novel functionalities.

3 Likes

It’s efforts like these, and say DataStructures.jl that I feel should just be included in Base Julia. Then again, I see why they aren’t, but now that I think about it, no I don’t. Awesome idea though!

5 Likes

I’ve tried to limit the package to stuff that I feel could conceivably put in base someday. I’m sure someone smarter than me will take some of it and make a far better version before that ever happens though.

1 Like

Yes I suppose we would statically encode the “insertability” of the indices at the type level, somehow. At the abstract level I’d rather use traits over abstract subtypes of AbstractIndices, though.

I thought you can just go ahead and define custom mutable and immutable indices types for now and then swap its internal based on Base.freeze / Base.melt , once #31630 is merged. (And also maybe add unsafe_freeze API to do “manual borrow checker” to avoid copying until #31630 arrives?) This, of course, relies on that the design of #31630 does not deviate much from the current version. But do you see pitfalls other than that?

Yeah, I’m already prototyping so much that I was afraid of getting too far ahead would start to make things over-ambitious, but yes we can do that. Ideally we’d overload functions that are defined prior to Dictionaries.jl though (really… we’ve needed to expand our traits for arrays and containers in general for a while now, so some shared package where we deal with these “container” ideas may be ideal).

2 Likes

Yeah, it would be nice to factor out mutability traits/implementations as a different package, so that Dictionaries.jl does not have to be responsible for all the advanced prototypings. IIUC DataFrames.jl also need something similar for names(df) and df[!, col]. But, previous discussion (and the resulting package ReadOnlyArrays.jl) did not explore mutability of indices/shape, IIRC. ArrayInterface.jl has similar traits (e.g., ismutable) but it focuses on the mutability of values and not indices. Also, it focuses on arrays and not containers in general. Anyway, I guess you are already aware of many of these. I’m just re-emphasising that there are immediate needs for this in Julia ecosystem.

Looking at DataFrames, it looks like it still is not using ReadOnlyArrays. I wonder if there is still a plan to use this. As it needs two kinds of immutabilities (immutable values and immutable indices/shape), it looks like the best place for this to happen.

1 Like