`unique(Vector{CustomStruct})` requires method for hash | Bug or misleading documentation?

I was surprised to see that it is not enough to implement isequal to get unique for free, though the documentation seems to suggest it:

https://docs.julialang.org/en/v1/base/collections/#Base.unique

Return an array containing only the unique elements of collection itr , as determined by isequal

The problem seems to be that unique implicitly assumes that struct1 == struct2 implies hash(struct1) == hash(struct2).

I noticed this behaviour for a struct with two fields where the order does not matter for comparison. Here an MWE:

struct UndirectedEdge
    f1
    f2
end

import Base.==
==(x::UndirectedEdge, y::UndirectedEdge) = (x.f1 == y.f1 && x.f2 == y.f2) || (x.f2 == y.f1 && x.f1 == y.f2)
e1 = UndirectedEdge(:a, :b)
e2 = UndirectedEdge(:b, :a) # equal to e1
e3 = UndirectedEdge(:c, :a)

unique([e1, e2, e3])

e1 == e2
isequal(e1, e2) # isequal should be enough for unique

It works, when I additionally define a hash method:

import Base.hash
hash(x::UndirectedEdge, h::UInt) = hash(sort([x.f1, x.f2]), h)

unique([e1, e2, e3])

It works for me if I define Base.isequal instead of Base.==.

EDIT: It even works when I define Base.== as you did (no Base.hash method necessary).

This is documented in the docs for isequal:

isequal is the comparison function used by hash tables (Dict). isequal(x,y) must imply that hash(x)
== hash(y)

, as well as in the docs of hash. But yes, this invariant is a little to easy to forget, for which there is an old issue: custom hashing is too easy to accidentally break · Issue #12198 · JuliaLang/julia · GitHub

2 Likes

Performance aside, can I define isequal in terms of hashing?

import Base.==
import Base.hash

hash(x::UndirectedEdge, h::UInt) = hash(sort([x.f1, x.f2]), h)
==(x::UndirectedEdge, y::UndirectedEdge) = hash(x) == hash(y)

I think so! But then you might get nasty surprises if you find a hash collision.

Thank you so much. It was not clear to me that I have to implement hash and isequal in tandem even if I do not care too much about hashing.

If you don’t implement them in tandem, things like putting your custom struct into a Dict may break, as it uses a hash to determine which bucket to place the object. It’s not necessarily related to whether you care about hashing itself, but what you want to do with it.

1 Like