Replacement for Dict, is anyone using isslotfilled?

Hi,

I’m looking into replacing Dict in Julia with an ordered implementation. I hear people would like that, any objections? Actually I already did that, and it works for me, but not for all methods older Dict provided.

You would think the OrderedDict in OrderedCollection.jl, the implementation I took, is a drop in replacement, but it doesn’t have “isslotfilled”, which isn’t exported, so my thinking is it might be rare for people to use it. And that method is only used once by Julia itself.

Have people used OrderedRobinDict much from DataStructures.jl? It DOES have it, but I think it might be less used, and thus less tested, with almost all needing ordered going with the other.

If I go with OrderedRobinDict, I would also need to add the unordered RobinDict it builds on (using Robin Hood hashing):

Base.@propagate_inbounds isslotfilled(h::RobinDict, index) = (h.hashes[index] != 0)
Base.@propagate_inbounds isslotempty(h::RobinDict, index) = (h.hashes[index] == 0)

In the original Dict, the similar method accesses slots, not hashes, so I can not make type pirating code work.

You may be interested in

Generally, if you want something changed in Base and you want concrete feedback, it is best to make a PR. From your post, it is unclear what it would improve.

If you have an idea for an improvement, it may be best to experiment in a package first.

I think that Base.isslotfilled is an internal implementation detail — no other code uses it other than a sampler in Random (which a PR would presumably change too). That said, you can’t just rewrite Dict, you need to change specialized the methods that implement various things on Dict.

2 Likes

I should have been more clear, I already made a PR: https://github.com/JuliaLang/julia/pull/37804

The isslotted was part of a problem I had, and I believe I fixed that and all others, just need to add my recent commits to the PR (the point is to get Dict ordered, I thought that part was clear) from my fork here:
https://github.com/JuliaLang/julia/compare/master...PallHaraldsson:master

I did try git push, and got an error as so tried again to my fork. It has some unrelated commits, and I’m just not sure how to pick only some, and forward to my PR.

To apply an individual commit to the branch from your PR, you can do:

git checkout <your branch name>
git cherry-pick <sha of the commit you want>
git push

It is unclear why you want to make Base.Dict ordered, especially given that ordered dicts are available various packages.

If you need an ordered dictionary, why not just use those? Or implement a new one if you think they lack features or could be done better.

3 Likes

For me it would be a quality of life improvement if using an ordered dictionary didn’t require

  • using OrderedCollections
  • Adding OrderedCollections to Project.toml.
  • Adding a compat entry for OrderedCollections to Project.toml.
  • Writing OrderedDict instead of Dict everywhere it’s used. 11 characters instead of 4 do add up in terms of both writing and reading, and has a tendency to run into line length issues.

Naturally this has to be weighed against speed and memory consumption but on balance I have use for insertion order a lot more often than I use dictionaries for anything performance critical.

One use case that is a bit more than a quality of life consideration is the Python interoperability. PyCall automatically converts between Python and Julia dictionaries and since Python these days guarantee insertion order it’s not ideal (*) that it’s lost in translation. It could be argued that PyCall should convert to OrderedDict but I think that’s a bit of a tough sell.

(*) I spent hours yesterday hunting down a bug that turned out to be caused by this.

3 Likes

I would love ordered dicts too. Unordered dicts caused me a lot of pain once because I accidentally iterated over them:

for (features, target) in training_set # a Dict
    train!(...)
end

Then the next Julia version changes the hash and there’s a mysterious change in test-set performance that must be tracked down because of course it could be some genuine bug.

As far as I’m concerned, Dict iteration is a bug waiting to happen. I’d be happier if it was an error by default and required an explicit for (k, v) in items(d) or sequential(d). But I might be in the minority.

1 Like

FYI: The new PR is at (and working despite “8 failing” [doc] checks): Change Dict to be ordered by default by PallHaraldsson · Pull Request #38145 · JuliaLang/julia · GitHub [I had git issues, and was spending way too much time figuring out, so ended up with the nuclear option of starting again with gt clone …]

About “unclear why you want” Gunnar, had all of the arguments for, I’ve been having plus:

About:

I think ordered will necessarily have more memory use, and be a bit slower, for some operations, IF dicts are big, while keeping same time (and memory) complexity. I think most Dicts in actual use are however small, and OrderedDict can actually be made faster than any unordered Dict (LittleDict already is, while it’s slower for big ones, you can have your cake and eat it too).

What I get now:

julia> d = Dict("A" => 1, "B" => 2)
Base.OrderedDict{String, Int64} with 2 entries:
  "A" => 1
  "B" => 2

This works, but I should in the end at least get show to show: Base.OrderedDict → Dict.

That’s not something I’ve thought about, and likely will not change to have the highest probability of my PR accepted. This would be a breaking change? Or since you consider this a bug, not? This could be changed in Julia 2.0, or earlier? Either way, open an issue, make a PR (if not already in andyferris’ code)?

is about an issue with Base.Dict changing order across serialization.
This is not caused by Dict being unordered as such, its actually caused by the new Dict post serialization being made the right size, where as the earlier DIct may have had colisions and done some bucketting things. You can ensure consistent order across serialization without ordering it and that issue shows how.
But @jeff.bezanson suggested that the better way to fix it would be to make it ordered.
I am not sure I agree, but it certainly would fix it.

2 Likes

To be clear, I meant unordered Dict iteration is a bug waiting to happen. Your PR would fix that. My radical defensive programming suggestion for sequential(dict) is fanciful, there’s no way that would get accepted, and it’s OK :slightly_smiling_face:

2 Likes

How does making dict ordered effect the equality of Dict. If you make it ordered it will be strange if Dicts with different orders are equal to one another. However changing that equality would be a breaking change.

Strange or not, that’s how it works with OrderedDict from OrderedCollections and with dictionaries in Python.

1 Like

Sorry, maybe I am understanding something wrong, but it does not seem to me that two Dicts with the same elements may be in different orders, the definition of order will not respect a single source of truth?

This is about insertion ordering rather then the ordering of the value of the keys.

Yes, and no, I mean it already happened, and seems impossible to fix while keeping bug-compatibility.* I thought I was almost done with my PR and then realized the specific undefined order is depended upon, even by Julia itself, by the test-failures I was seeing.

E.g.

julia> using REPL.REPLCompletions

julia> REPLCompletions.latex_symbols
Dict{String, String} with 2500 entries:
  "\\sqrt"        => "√"
  "\\cbrt"        => "∛"
  "\\female"      => "♀"
[..]

julia> REPL.symbol_latex("√")
"\\surd"

before this returned “\sqrt”, but while both are in a sense right, I would have to fix test errors that rely on something specific, here that (or better yet fix the dict-using code to get “\sqrt” again), and the broken tests seem endless, 1000+ lines to go through.

If/since there’s a “bug” here relying on specific “randomized” generated order (unlike the dict above with my “fix”, having same order as the listing from the file generating it: https://github.com/JuliaLang/julia/blob/b8df95fe05fabd821e26edf38008470b703fa36d/stdlib/REPL/src/latex_symbols.jl#L95), then such could be also be relied on by users code:

Error in testset docs:
Test Failed at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\docs.jl:1002
  Expression: sprint(repl_latex, "√") == "\"√\" can be typed by \\sqrt<tab>\n\n"
   Evaluated: "\"√\" can be typed by \\surd<tab>\n\n" == "\"√\" can be typed by \\sqrt<tab>\n\n"

* @stevengj, I know how to “fix” that in a sense, keep using the old unordered dict for that specific thing (and it would even be brittle if hashing is changed later).

I suspect more test failures are because the undefined order is relied upon, E.g.:

This one also worrying (and other one with “attempt to access 10-element Vector{Int64} at index [16]”):

Error During Test at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\loading.jl:586
  Got exception outside of a @test
  BoundsError: attempt to access 105-element Vector{String} at index [241]
  Stacktrace:
    [1] getindex
      @ .\array.jl:809 [inlined]
    [2] rand(rng::MersenneTwister, sp::Random.SamplerSimple{Dict{String, Any}, Random.SamplerSimple{LinearIndices{1, Tuple{Base.OneTo{Int64}}}, Random.SamplerRangeNDL{UInt64, Int64}, Int64}, Pair{String, Any}})
      @ Random C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\generation.jl:403
    [3] rand!(rng::MersenneTwister, A::Vector{Pair{String, Any}}, sp::Random.SamplerSimple{Dict{String, Any}, Random.SamplerSimple{LinearIndices{1, Tuple{Base.OneTo{Int64}}}, Random.SamplerRangeNDL{UInt64, Int64}, Int64}, Pair{String, Any}})
      @ Random C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:271
    [4] rand!
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:266 [inlined]
    [5] rand
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:279 [inlined]
    [6] rand(X::Dict{String, Any}, dims::Tuple{Int64})
      @ Random C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:280
    [7] rand
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:283 [inlined]
    [8] macro expansion
      @ C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\loading.jl:587 [inlined]
    [9] macro expansion
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Test\src\Test.jl:1144 [inlined]
   [10] top-level scope
      @ C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\loading.jl:587
   [11] include
      @ .\Base.jl:393 [inlined]
   [12] macro expansion
      @ C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:24 [inlined]
   [13] macro expansion
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Test\src\Test.jl:1144 [inlined]
   [14] macro expansion
      @ C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:23 [inlined]
   [15] macro expansion
      @ .\timing.jl:343 [inlined]
   [16] runtests(name::String, path::String, isolate::Bool; seed::UInt128)
      @ Main C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:21
   [17] (::Distributed.var"#106#108"{Distributed.CallMsg{:call_fetch}})()
      @ Distributed C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:278
   [18] run_work_thunk(thunk::Distributed.var"#106#108"{Distributed.CallMsg{:call_fetch}}, print_error::Bool)
      @ Distributed C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:63
   [19] macro expansion
      @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:278 [inlined]
   [20] (::Distributed.var"#105#107"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
      @ Distributed .\task.jl:395

Has anybody compared Dictioanries.jl with OrderedCollections.jl?
What’s the difference?
Which one is faster?

Speed is not the only matter, correctness is too, why I opened the issue:

and LittleDict in OrderedCollections.jl claims to be fastest, for small dicts, faster than unordered, e.g. the default Dict.

About Dictionaries.jl it claims:

The three main difference to Dict are that it preserves the order of elements, it iterates much faster, and it iterates values rather than key-value pairs.

No dict is going to be fastest for everything, that said, newer than in Julia standard library:

I assume the latter as it is newer is the faster of those unordered [EDIT: I assumed SwissDict is unordered, but not longer sure, it seems to be ordered]. It’s not Google’s code, but is a reimplementation of:

we were rolling out across Google’s codebase. […]
The “flat” Swiss tables should be your default choice.

I’m still not sure that or any unordered should be most people’s go-to dict, as the speed difference shouldn’t be that big, and ordered can be convenient, and prevents people from shooting themselves in the foot.

EDIT: If it’s the fastest, and ordered, then it seems like a no-brainer to use it. [EDIT: I see SwishDict, uses assembly, so currently tying you do 64-bit x86 I think, so that’s one argument against it if you want portability, but it also gives you speed.]

1 Like

I did make a PR for ODict to Julia so that you wouldn’t have do the above:

it was closed.

In case you missed it, the general tendency is to move things out of Base (and to a certain extent, the standard libraries).

The opposite happens, but requires very compelling arguments — simply the fact that some has to be using a package is usually not sufficient.

3 Likes