Inconsistent hashing of types containing `Set` or `Array` fields


struct TestType

t1 = TestType(Set([1,2]))
t2 = TestType(Set([1,2]))

Then I do

julia> hash(t1)

julia> hash(t2)

Well, that’s kind of annoying, but at this point I’m starting to imagine reasons why this might be the case, but then I do

julia> hash(t1.S)

julia> hash(t2.S)

I’ve now discovered that this happens with types containing Array (only custom types of course! I didn’t notice this until I created a few and tested them). Since that’s a pretty important use case, I’m imagining there is a good reason for this behavior. Any idea what that might be?

The fallback method for hash() just uses object_id. Try:

@edit hash(t1, UInt(0))

Even though t1 and t2 have the same contents, they are different non-bits structs so they have different identities. You might not see this effect with a bitstype struct because identical bitstype structs seem to have the same object_id():

julia> struct BitsType

julia> b = BitsType(1)

julia> @edit hash(b, UInt(0))

julia> Base.object_id(b)

julia> Base.object_id(BitsType(1))
1 Like

At a slightly higher level:

Your custom object could care more about behavior than contents. In your example, t1 and t2 are completely independent, and can — in the future — diverge. This means that they’re not programmatically indistinguishable… so we take the conservative route and say that they’re neither == nor do they hash equal.

If you want them to behave as their contents do, then you need to decide which fields are important in their equality. It may be that not all fields are the same.

julia> struct TestType

julia> t1 = TestType(Set([1,2]))
TestType(Set([2, 1]))

julia> t2 = TestType(Set([1,2]))
TestType(Set([2, 1]))

julia> t1 == t2

julia> push!(t1.S, 3)
Set([2, 3, 1])

julia> t1, t2
(TestType(Set([2, 3, 1])), TestType(Set([2, 1])))
1 Like

I guess my concept of == needs adjusting. I was thinking of it as being like mathematical = wherever possible. What you guys are talking about sounds more like === to me. Just because two objects may not be equal at some future t > t_{0} doesn’t mean that they are not equal at t_{0}, so I’d expect == to reflect this.

I understand that performance could get to be an issue here for hashing of large data, but it seems to me that most users would be conscious of this, so again, I was expecting default hashing to respect the naive concept of equality, and that it would be up to users to replace the hash function if it becomes too computationally intensive.

Welcome to


Oh god, an open issue from 2013 :fearful:. I can see that I’ve walked into a minefield :laughing:.