Unique doesn't seem to obey == definitions

The behaviour for unique(itr) doesn’t seem to follow the documentation which specifies items are compared using an isequal check.

This got me when I have a custom struct, e.g.

mutable struct A
x
end
Base.var"=="(one::A, two::A) = (one.x == two.x)

and then wanted to do something like:

temp = [A(1), A(2), A(1)]
temp[1] == temp[3] # returns true
unique(temp) # returns three elements not [A(1), A(2)] as expected

Is there something trivial I am forgetting here?

Digging deeper into this, it seems that all variants of unique eventually call hash to figure out whether two objects are the same. In the case of a generic iterate, this happens in the check !in(x, seen) where seen is a Set which uses hashes. In the case of an AbstractArray hashes are used directly.

So I presume this means I am required to implement hash as well as == to use unique? Should the documentation be updated to reflect this?

I should add, that in my current case implementing a hash is difficult whereas == is trivial, because I am using CxxWrap.jl objects.

This typically means that types for which a custom == or isequal method exists must implement a corresponding hash method (and vice versa). Collections typically implement isequal by calling isequal recursively on all contents.

There is already a cross reference though adding emphasis might be OK.

2 Likes

Oh I see - I didn’t dig down into the isequal documentation once I had read that unique uses isequal and verified that isequal was behaving correctly at the REPL.

That makes sense now. However, I think the documentation of unique is then just plain wrong. The implementation of unique never calls isequal, it only uses hash. Maybe the unique documentation could be changed to read:

Return an array containing only the unique elements of collection itr, as determined by hash, in the order that …

Does that sound right?

Edit: I realise I forgot to ask, should I submit a PR for this?

1 Like

No the document is not wrong and unique does call isequal.

In the implementation in https://github.com/JuliaLang/julia/blob/7f7ab69fd30e7ea57ffc1c6a8c803826bf02f838/base/set.jl#L122 which is for the generic iterator, unique does not call isequal. It instead constructs a Set which uses hash.

The only call to isequal that I can see relevant to isunique is in _groupedunique! which is only called from the unique! entrypoint and only if the first argument is a Vector of reals and is already sorted.

The hash function is required to produce equal hashes for isequal values. Moreover Set (which is implemented using Dict) may check isequal (since explicit equality checks are ultimately necessary for hash collision resolution).

1 Like

Oh I think I see the issue now. Thinking from the point of view of unique and pretending I don’t know it uses a Set/Dict internally, I can think of it as:

  1. Check if hashes differ - if so then items are different.
  2. Otherwise, check isequal between the items.

So hash acts as a shortcircuit.

Thanks @yuyichao and @stevengj for the explanations!

1 Like

From a comment by @yuyichao in the GH PR, it finally clicked for me that unique is a function that acts on sets, which makes perfect sense for where it is defined in set.jl.

But this makes me realise that I need a function unique which never tries to hash the items in the list. I pretty much need a function that relies only on transitivity of ==. This might be best explained as a “group by” operation followed by selecting one element from each group. Here’s a crappy implementation:

function group_by(itr ; by=Base.:(==))

    groups = Vector{Any}[]

    for item in itr
        ind = findfirst(x -> by(x, first(g)), groups)
        if ind === nothing
            push!(groups, Any[item])
        else
            push!(groups[ind], item)
        end
    end

    groups
end

function unique_by(itr ; by=Base.:(==))
    groups = group_by(itr ; by)
    return first.(groups)
end

This kind of unique does not have any guarantees over which item is returned, as all of the items in each group may contain hidden state (e.g. I could have just as easily used last instead of first). But every item in each group must satisfy x == y.

To bring this back to my use case, I have objects that have been wrapped by CxxWrap.jl and so are only pointers. I can invoke a C++ function a == b but I don’t have any sensible way to give these objects a hash.

The unique_by above could be made more efficient (it doesn’t need to form the groups first) but I think this is illustrating the difference between a Dict and a set of lists.