While trying to do a puzzle problem, I notice that two different Dict
s resulted in the same hash(...)
value, even though the Dict
s are not equivalent, as shown below.
julia> VERSION
v"0.5.0"
julia> a = Dict(Dict(3 => 4, 2 => 3) => 2, Dict(1 => 2, 5 => 6) => 1)
Dict{Dict{Int64,Int64},Int64} with 2 entries:
Dict(5=>6,1=>2) => 1
Dict(2=>3,3=>4) => 2
julia> b = Dict(Dict(1 => 2, 2 => 3, 5 => 6) => 1, Dict(3 => 4) => 2)
Dict{Dict{Int64,Int64},Int64} with 2 entries:
Dict(3=>4) => 2
Dict(2=>3,5=>6,1=>2) => 1
julia> hash(a)
0x2a9d36de3d0d9307
julia> hash(b)
0x2a9d36de3d0d9307
julia> isequal(a, b)
false
julia> c = Dict(Dict() => 1, Dict(3 => 4, 1 => 2, 5 => 6, 2 => 3) => 2)
Dict{Any,Int64} with 2 entries:
Dict{Any,Any}() => 1
Dict(2=>3,3=>4,5=>6,1=>2) => 2
julia> hash(c)
0x2a9d36de3d0d9307
The same thing happens in version 0.6.0-dev.1658
Is this expected behaviour or a bug?
It’s not really a bug.
There’s an issue here, because the algorithm for calculating a hash value for the abstract Associative
works by simply XORing the results of hashing the keys using the hash of the value as the seed, i.e. hash(key, hash(value))
, and that seems to end up frequently with collisions.
const hasha_seed = UInt === UInt64 ? 0x6d35bb51952d5539 : 0x952d5539
function hash(a::Associative, h::UInt)
h = hash(hasha_seed, h)
for (k,v) in a
h ⊻= hash(k, hash(v))
end
return h
end
I imagine that was done so that two dictionaries with the same contents, but different orders, would have the same hash value.
When you iterate through the (key,value)
pairs in a dictionary, the order is undefined, so the hash cannot depend on this, hence the xor. However, the hash should depend on which keys go with which values. It does for most keys, but apparently not for dictionary keys, and this seems like a bug in the algorithm.
I guess that depends on your definition of “bug”.
To me, a bug is when you get incorrect results, or there is an inconsistency between the documentation and the results (which could be a bug in the code, the documentation, or both). An inconsistency between behavior of different types/methods where a reasonable programmer might expect them to be consistent might indicate a bug, or some possibly undocumented design decision (in which case, it’s a documentation issue).
That’s not the case here (a good example of what I felt really was a bug with Dict
s was #15077, which was fixed by https://github.com/JuliaLang/julia/pull/15176).
Having these collide could potentially cause a performance issue, but the results would still be correct.
Note, that this problem / performance issue happens for all subtypes of Associative
, unless they have implemented their own hash
method (and haven’t copied the same flaw).
I already have a “fix” ready for this, in https://github.com/ScottPJones/julia/tree/spj/fixasshash if somebody could make it into a PR. It also has tests added.
Thanks very much for making a PR out of that!
There seems to be an unrelated failure (with a Pkg test) on AppVeyor - it had already gotten past the Dict unit tests.