Replacement for Dict, is anyone using isslotfilled?

Palli · October 22, 2020, 10:51am

Hi,

I’m looking into replacing Dict in Julia with an ordered implementation. I hear people would like that, any objections? Actually I already did that, and it works for me, but not for all methods older Dict provided.

You would think the OrderedDict in OrderedCollection.jl, the implementation I took, is a drop in replacement, but it doesn’t have “isslotfilled”, which isn’t exported, so my thinking is it might be rare for people to use it. And that method is only used once by Julia itself.

Have people used OrderedRobinDict much from DataStructures.jl? It DOES have it, but I think it might be less used, and thus less tested, with almost all needing ordered going with the other.

If I go with OrderedRobinDict, I would also need to add the unordered RobinDict it builds on (using Robin Hood hashing):

Base.@propagate_inbounds isslotfilled(h::RobinDict, index) = (h.hashes[index] != 0)
Base.@propagate_inbounds isslotempty(h::RobinDict, index) = (h.hashes[index] == 0)

In the original Dict, the similar method accesses slots, not hashes, so I can not make type pirating code work.

Tamas_Papp · October 22, 2020, 12:02pm

You may be interested in

Generally, if you want something changed in Base and you want concrete feedback, it is best to make a PR. From your post, it is unclear what it would improve.

If you have an idea for an improvement, it may be best to experiment in a package first.

I think that Base.isslotfilled is an internal implementation detail — no other code uses it other than a sampler in Random (which a PR would presumably change too). That said, you can’t just rewrite Dict, you need to change specialized the methods that implement various things on Dict.

Palli · October 22, 2020, 5:45pm

I should have been more clear, I already made a PR: https://github.com/JuliaLang/julia/pull/37804

The isslotted was part of a problem I had, and I believe I fixed that and all others, just need to add my recent commits to the PR (the point is to get Dict ordered, I thought that part was clear) from my fork here:
https://github.com/JuliaLang/julia/compare/master...PallHaraldsson:master

I did try git push, and got an error as so tried again to my fork. It has some unrelated commits, and I’m just not sure how to pick only some, and forward to my PR.

rdeits · October 22, 2020, 5:52pm

To apply an individual commit to the branch from your PR, you can do:

git checkout <your branch name>
git cherry-pick <sha of the commit you want>
git push

Tamas_Papp · October 23, 2020, 8:22am

It is unclear why you want to make Base.Dict ordered, especially given that ordered dicts are available various packages.

If you need an ordered dictionary, why not just use those? Or implement a new one if you think they lack features or could be done better.

GunnarFarneback · October 23, 2020, 9:18am

For me it would be a quality of life improvement if using an ordered dictionary didn’t require

using OrderedCollections
Adding OrderedCollections to Project.toml.
Adding a compat entry for OrderedCollections to Project.toml.
Writing OrderedDict instead of Dict everywhere it’s used. 11 characters instead of 4 do add up in terms of both writing and reading, and has a tendency to run into line length issues.

Naturally this has to be weighed against speed and memory consumption but on balance I have use for insertion order a lot more often than I use dictionaries for anything performance critical.

One use case that is a bit more than a quality of life consideration is the Python interoperability. PyCall automatically converts between Python and Julia dictionaries and since Python these days guarantee insertion order it’s not ideal (*) that it’s lost in translation. It could be argued that PyCall should convert to OrderedDict but I think that’s a bit of a tough sell.

(*) I spent hours yesterday hunting down a bug that turned out to be caused by this.

cstjean · October 23, 2020, 10:40am

I would love ordered dicts too. Unordered dicts caused me a lot of pain once because I accidentally iterated over them:

for (features, target) in training_set # a Dict
    train!(...)
end

Then the next Julia version changes the hash and there’s a mysterious change in test-set performance that must be tracked down because of course it could be some genuine bug.

As far as I’m concerned, Dict iteration is a bug waiting to happen. I’d be happier if it was an error by default and required an explicit for (k, v) in items(d) or sequential(d). But I might be in the minority.

Palli · October 23, 2020, 11:21am

FYI: The new PR is at (and working despite “8 failing” [doc] checks): Change Dict to be ordered by default by PallHaraldsson · Pull Request #38145 · JuliaLang/julia · GitHub [I had git issues, and was spending way too much time figuring out, so ended up with the nuclear option of starting again with gt clone …]

About “unclear why you want” Gunnar, had all of the arguments for, I’ve been having plus:

About:

I think ordered will necessarily have more memory use, and be a bit slower, for some operations, IF dicts are big, while keeping same time (and memory) complexity. I think most Dicts in actual use are however small, and OrderedDict can actually be made faster than any unordered Dict (LittleDict already is, while it’s slower for big ones, you can have your cake and eat it too).

What I get now:

julia> d = Dict("A" => 1, "B" => 2)
Base.OrderedDict{String, Int64} with 2 entries:
  "A" => 1
  "B" => 2

This works, but I should in the end at least get show to show: Base.OrderedDict → Dict.

Palli · October 23, 2020, 11:25am

That’s not something I’ve thought about, and likely will not change to have the highest probability of my PR accepted. This would be a breaking change? Or since you consider this a bug, not? This could be changed in Julia 2.0, or earlier? Either way, open an issue, make a PR (if not already in andyferris’ code)?

oxinabox · October 23, 2020, 12:58pm

github.com/JuliaLang/julia

make Dict ordered?

opened 03:07PM - 05 Jan 20 UTC

oxinabox

decision collections

@bkamins pointed out on slack: ``` julia> using Serialization; julia> d = …Dict{Symbol, Vector{Int}}(Symbol.('a':'z') .=> Ref([1])); julia> serialize("test.bin", d); julia> d2 = deserialize("test.bin"); julia> hcat(collect.(keys.([d, d2]))...) 26×2 Array{Symbol,2}: :o :j :b :x :p :d :n :k :j :g :e :u :c :r :h :a :l :m :w :y :x :i :d :o :k :b :s :p :v :n :g :e :u :c :q :h :r :l :z :w :a :s :f :v :m :z :y :q :i :f :t :t ``` But it doesn't have to be this way. If we redefine things so it remembers how many slots it should have, then it comes out the same as it came in. ``` hintsize(dict::AbstractDict) = length(dict) hintsize(dict::Dict) = length(dict.keys) function Serialization.deserialize_dict(s::AbstractSerializer, T::Type{<:AbstractDict}) n = read(s.io, Int32) sz = read(s.io, Int32) t = T(); sizehint!(t, sz) Serialization.deserialize_cycle(s, t) for i = 1:n k = deserialize(s) v = deserialize(s) t[k] = v end return t end function Serialization.serialize_dict_data(s::AbstractSerializer, d::AbstractDict) write(s.io, Int32(length(d))) write(s.io, Int32(length(d.slots))) for (k,v) in d serialize(s, k) serialize(s, v) end end ``` But this is annoying because it changes the serialization format. I would rather change `sizehint!(::Dict)` or how we call it. The problem is that `sizehint!(Dict(), 26)` gives it 32 slots, but the `d` had 64 slots. --- In python this was one of thing things that really caught me out. Because python salts its hashes wiith a random salt selected each time it starts. But julia doesn't.

is about an issue with Base.Dict changing order across serialization.
This is not caused by Dict being unordered as such, its actually caused by the new Dict post serialization being made the right size, where as the earlier DIct may have had colisions and done some bucketting things. You can ensure consistent order across serialization without ordering it and that issue shows how.
But @jeff.bezanson suggested that the better way to fix it would be to make it ordered.
I am not sure I agree, but it certainly would fix it.

cstjean · October 23, 2020, 1:57pm

To be clear, I meant unordered Dict iteration is a bug waiting to happen. Your PR would fix that. My radical defensive programming suggestion for sequential(dict) is fanciful, there’s no way that would get accepted, and it’s OK

WschW · October 23, 2020, 2:03pm

How does making dict ordered effect the equality of Dict. If you make it ordered it will be strange if Dicts with different orders are equal to one another. However changing that equality would be a breaking change.

GunnarFarneback · October 23, 2020, 2:12pm

Strange or not, that’s how it works with OrderedDict from OrderedCollections and with dictionaries in Python.

Henrique_Becker · October 23, 2020, 2:19pm

Sorry, maybe I am understanding something wrong, but it does not seem to me that two Dicts with the same elements may be in different orders, the definition of order will not respect a single source of truth?

WschW · October 23, 2020, 2:33pm

This is about insertion ordering rather then the ordering of the value of the keys.

Palli · October 24, 2020, 12:27am

Yes, and no, I mean it already happened, and seems impossible to fix while keeping bug-compatibility.* I thought I was almost done with my PR and then realized the specific undefined order is depended upon, even by Julia itself, by the test-failures I was seeing.

E.g.

julia> using REPL.REPLCompletions

julia> REPLCompletions.latex_symbols
Dict{String, String} with 2500 entries:
  "\\sqrt"        => "√"
  "\\cbrt"        => "∛"
  "\\female"      => "♀"
[..]

julia> REPL.symbol_latex("√")
"\\surd"

before this returned “\sqrt”, but while both are in a sense right, I would have to fix test errors that rely on something specific, here that (or better yet fix the dict-using code to get “\sqrt” again), and the broken tests seem endless, 1000+ lines to go through.

If/since there’s a “bug” here relying on specific “randomized” generated order (unlike the dict above with my “fix”, having same order as the listing from the file generating it: https://github.com/JuliaLang/julia/blob/b8df95fe05fabd821e26edf38008470b703fa36d/stdlib/REPL/src/latex_symbols.jl#L95), then such could be also be relied on by users code:

github.com

JuliaLang/julia/blob/a1da84c3b0406e8378d8af781fe74ce749725d1f/stdlib/REPL/src/docview.jl#L329


      
              print(io, "Couldn't find ")
              printstyled(io, s, '\n', color=:cyan)
              print_correction(io, s)
          end
          repl_corrections(s) = repl_corrections(stdout, s)
          
          # inverse of latex_symbols Dict, lazily created as needed
          const symbols_latex = Dict{String,String}()
          function symbol_latex(s::String)
              if isempty(symbols_latex) && isassigned(Base.REPL_MODULE_REF)
                  for (k,v) in Iterators.flatten((REPLCompletions.latex_symbols,
                                                  REPLCompletions.emoji_symbols))
                      symbols_latex[v] = k
                  end
              end
              return get(symbols_latex, s, "")
          end
          function repl_latex(io::IO, s::String)
              # decompose NFC-normalized identifier to match tab-completion input
              s = normalize(s, :NFD)
              latex = symbol_latex(s)

Error in testset docs:
Test Failed at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\docs.jl:1002
  Expression: sprint(repl_latex, "√") == "\"√\" can be typed by \\sqrt<tab>\n\n"
   Evaluated: "\"√\" can be typed by \\surd<tab>\n\n" == "\"√\" can be typed by \\sqrt<tab>\n\n"

* @stevengj, I know how to “fix” that in a sense, keep using the old unordered dict for that specific thing (and it would even be brittle if hashing is changed later).

I suspect more test failures are because the undefined order is relied upon, E.g.:

This one also worrying (and other one with “attempt to access 10-element Vector{Int64} at index [16]”):

Error During Test at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\loading.jl:586
  Got exception outside of a @test
  BoundsError: attempt to access 105-element Vector{String} at index [241]
  Stacktrace:
    [1] getindex
      @ .\array.jl:809 [inlined]
    [2] rand(rng::MersenneTwister, sp::Random.SamplerSimple{Dict{String, Any}, Random.SamplerSimple{LinearIndices{1, Tuple{Base.OneTo{Int64}}}, Random.SamplerRangeNDL{UInt64, Int64}, Int64}, Pair{String, Any}})
      @ Random C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\generation.jl:403
    [3] rand!(rng::MersenneTwister, A::Vector{Pair{String, Any}}, sp::Random.SamplerSimple{Dict{String, Any}, Random.SamplerSimple{LinearIndices{1, Tuple{Base.OneTo{Int64}}}, Random.SamplerRangeNDL{UInt64, Int64}, Int64}, Pair{String, Any}})
      @ Random C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:271
    [4] rand!
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:266 [inlined]
    [5] rand
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:279 [inlined]
    [6] rand(X::Dict{String, Any}, dims::Tuple{Int64})
      @ Random C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:280
    [7] rand
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:283 [inlined]
    [8] macro expansion
      @ C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\loading.jl:587 [inlined]
    [9] macro expansion
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Test\src\Test.jl:1144 [inlined]
   [10] top-level scope
      @ C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\loading.jl:587
   [11] include
      @ .\Base.jl:393 [inlined]
   [12] macro expansion
      @ C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:24 [inlined]
   [13] macro expansion
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Test\src\Test.jl:1144 [inlined]
   [14] macro expansion
      @ C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:23 [inlined]
   [15] macro expansion
      @ .\timing.jl:343 [inlined]
   [16] runtests(name::String, path::String, isolate::Bool; seed::UInt128)
      @ Main C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:21
   [17] (::Distributed.var"#106#108"{Distributed.CallMsg{:call_fetch}})()
      @ Distributed C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:278
   [18] run_work_thunk(thunk::Distributed.var"#106#108"{Distributed.CallMsg{:call_fetch}}, print_error::Bool)
      @ Distributed C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:63
   [19] macro expansion
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:278 [inlined]
   [20] (::Distributed.var"#105#107"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
      @ Distributed .\task.jl:395

Juan · October 24, 2020, 1:28am

Has anybody compared Dictioanries.jl with OrderedCollections.jl?
What’s the difference?
Which one is faster?

Palli · October 24, 2020, 9:02pm

Speed is not the only matter, correctness is too, why I opened the issue:

github.com/JuliaLang/julia

Add ODict to Base for OrderedDict

opened 04:05PM - 24 Oct 20 UTC

closed 10:59AM - 26 Oct 20 UTC

PallHaraldsson

"_unordered_ `Dict` iteration is a bug waiting to happen. Your PR would fix that…." My answer: https://discourse.julialang.org/t/replacement-for-dict-is-anyone-using-isslotfilled/48810/16?u=palli I started that thread, asking if people would like Dict changed to ordered, but I realize by now that it's probably better to add the non-default ODict, so people can choose that, easily. At some point, e.g. Julia 2.0 we could change Dict to refer to ODict. I neglected to mention the PR #38145, and one of the answers were: "Generally, if you want something changed in Base and you want concrete feedback, it is best to make a PR" elsewhere, people called for discussion. My PR already adds OrderedDict, and it works, it's just test failing because I'm replacing Dict, and with rather adding it as ODict, all the problems would go away. @andyferris, I DO have some concerns, I thought I could choose any of the implementations, of ordered (not sorted) dicts available, but there's also this one I haven't looked at too closely (would it be better/ready?): https://github.com/andyferris/Dictionaries.jl/issues/31#issuecomment-700461649 >Midway through that I realised ordered dictionaries were a massive usability improvement, so that got incorporated here also. > >The concrete implementation of Base.Dict is almost orthogonal in a sense - so long as we can merge it in a non-breaking way, it could enter Base at a different time (and Julia 1.6 makes a lot of sense to me, also). Python added ordered dict to its standard library (despite "existing Python ordered-dict implementations") in 2008, with in rationale "PHP and Ruby 1.9 guarantee a certain order on iteration": https://www.python.org/dev/peps/pep-0372/ >Code ported from other programming languages such as PHP often depends on an ordered dict. Having an implementation of an ordering-preserving dictionary in the standard library could ease the transition and improve the compatibility of different libraries. [..] Comparing two ordered dictionaries implies that the test will be order-sensitive [..] When ordered dicts are compared with other Mappings, their order insensitive comparison is used. This allows ordered dictionaries to be substituted anywhere regular dictionaries are used. [..] Keeping a sorted list of keys is fast for all operations except __delitem__() which becomes an O(n) exercise. This data structure leads to very simple code and little wasted space. Then for Python 3.7, in 2017-2018 the default dict became ordered by default: https://mail.python.org/pipermail/python-dev/2017-December/151283.html > * 50% less memory usage > * 15% faster creation > * 100% (2x) faster iteration > * 20% slower move_to_end > * 40% slower comparison with changed implementation: Remove doubly-linked list from C OrderedDict https://bugs.python.org/issue31265#msg301942

and LittleDict in OrderedCollections.jl claims to be fastest, for small dicts, faster than unordered, e.g. the default Dict.

About Dictionaries.jl it claims:

The three main difference to Dict are that it preserves the order of elements, it iterates much faster, and it iterates values rather than key-value pairs.

No dict is going to be fastest for everything, that said, newer than in Julia standard library:

RobinDict (implemented with Robin Hood Hashing)

SwissDict (inspired from SwissTables)

I assume the latter as it is newer is the faster of ~~those unordered~~ [EDIT: I assumed SwissDict is unordered, but not longer sure, it seems to be ordered]. It’s not Google’s code, but is a reimplementation of:

we were rolling out across Google’s codebase. […]
The “flat” Swiss tables should be your default choice.

I’m still not sure ~~that~~ or any unordered should be most people’s go-to dict, as the speed difference shouldn’t be that big, and ordered can be convenient, and prevents people from shooting themselves in the foot.

EDIT: If it’s the fastest, and ordered, then it seems like a no-brainer to use it. [EDIT: I see SwishDict, uses assembly, so currently tying you do 64-bit x86 I think, so that’s one argument against it if you want portability, but it also gives you speed.]

Palli · October 26, 2020, 12:19pm

I did make a PR for ODict to Julia so that you wouldn’t have do the above:

github.com/JuliaLang/julia

Ordered dict based on SwissDict

JuliaLang:master ← PallHaraldsson:ordered_dict_based_on_swiss_dict

opened 11:44PM - 24 Oct 20 UTC

PallHaraldsson

+1825 -3

Fixes #38163 (not 38135 that got here by accident...). See this interesting t…alk, why I chose this implementation: https://www.youtube.com/watch?v=ncHmEUmJZf4&t=3s >This talk describes the process of design and optimization that starts with std::unordered_map and ends with a new design we call "SwissTable", a 2-level N-way associative hash table. Our implementation of this new design gets 2-3x better performance with significant memory reductions (compared to unordered_map) and is being broadly deployed across Google. https://abseil.io/blog/20180927-swisstables [EDIT: Attribution: Code is modified code from @eulerkochy's taken from JuliaCollections/DataStructures.jl.]

it was closed.

Tamas_Papp · October 26, 2020, 12:43pm

In case you missed it, the general tendency is to move things out of Base (and to a certain extent, the standard libraries).

The opposite happens, but requires very compelling arguments — simply the fact that some has to be using a package is usually not sufficient.

Topic		Replies	Views
Ordered Dict in Base: Is LittleDict not thread safe, and possibly the reason I can't add it to Base? General Usage	11	1553	September 29, 2020
Performance of Dictionaries.jl vs Base.Dict Internals & Design package , performance , dictionary , dictionaries	32	3226	December 13, 2024
Why don't Julia dictionaries preserve order? New to Julia dictionary	4	4015	November 19, 2018
Poor time performance on Dict? Performance	26	19095	March 12, 2018
[ANN] Dictionaries.jl - Improved productivity and performance of dictionaries in Julia Package Announcements dictionary , dictionaries	22	6195	December 15, 2019

Replacement for Dict, is anyone using isslotfilled?

Related topics