Different Dicts hashing to the same value

question

#1

While trying to do a puzzle problem, I notice that two different Dicts resulted in the same hash(...) value, even though the Dicts are not equivalent, as shown below.

julia> VERSION
v"0.5.0"

julia> a = Dict(Dict(3 => 4, 2 => 3) => 2, Dict(1 => 2, 5 => 6) => 1)
Dict{Dict{Int64,Int64},Int64} with 2 entries:
  Dict(5=>6,1=>2) => 1
  Dict(2=>3,3=>4) => 2

julia> b = Dict(Dict(1 => 2, 2 => 3, 5 => 6) => 1, Dict(3 => 4) => 2)
Dict{Dict{Int64,Int64},Int64} with 2 entries:
  Dict(3=>4)           => 2
  Dict(2=>3,5=>6,1=>2) => 1

julia> hash(a)
0x2a9d36de3d0d9307

julia> hash(b)
0x2a9d36de3d0d9307

julia> isequal(a, b)
false

julia> c = Dict(Dict() => 1, Dict(3 => 4, 1 => 2, 5 => 6, 2 => 3) => 2)
Dict{Any,Int64} with 2 entries:
  Dict{Any,Any}()           => 1
  Dict(2=>3,3=>4,5=>6,1=>2) => 2

julia> hash(c)
0x2a9d36de3d0d9307

The same thing happens in version 0.6.0-dev.1658

Is this expected behaviour or a bug?


#2

I think this is a bug; I filed it as https://github.com/JuliaLang/julia/issues/19995


#3

It’s not really a bug.
There’s an issue here, because the algorithm for calculating a hash value for the abstract Associative works by simply XORing the results of hashing the keys using the hash of the value as the seed, i.e. hash(key, hash(value)), and that seems to end up frequently with collisions.

const hasha_seed = UInt === UInt64 ? 0x6d35bb51952d5539 : 0x952d5539
function hash(a::Associative, h::UInt)
    h = hash(hasha_seed, h)
    for (k,v) in a
        h ⊻= hash(k, hash(v))
    end
    return h
end

I imagine that was done so that two dictionaries with the same contents, but different orders, would have the same hash value.


#4

When you iterate through the (key,value) pairs in a dictionary, the order is undefined, so the hash cannot depend on this, hence the xor. However, the hash should depend on which keys go with which values. It does for most keys, but apparently not for dictionary keys, and this seems like a bug in the algorithm.


#5

I guess that depends on your definition of “bug”.

To me, a bug is when you get incorrect results, or there is an inconsistency between the documentation and the results (which could be a bug in the code, the documentation, or both). An inconsistency between behavior of different types/methods where a reasonable programmer might expect them to be consistent might indicate a bug, or some possibly undocumented design decision (in which case, it’s a documentation issue).

That’s not the case here (a good example of what I felt really was a bug with Dicts was #15077, which was fixed by https://github.com/JuliaLang/julia/pull/15176).

Having these collide could potentially cause a performance issue, but the results would still be correct.

Note, that this problem / performance issue happens for all subtypes of Associative, unless they have implemented their own hash method (and haven’t copied the same flaw).

I already have a “fix” ready for this, in https://github.com/ScottPJones/julia/tree/spj/fixasshash if somebody could make it into a PR. It also has tests added.


#6

Thanks very much for making a PR out of that! :grinning:


#7

There seems to be an unrelated failure (with a Pkg test) on AppVeyor - it had already gotten past the Dict unit tests.