The behaviour for unique(itr) doesn’t seem to follow the documentation which specifies items are compared using an isequal check.
This got me when I have a custom struct, e.g.
mutable struct A
x
end
Base.var"=="(one::A, two::A) = (one.x == two.x)
and then wanted to do something like:
temp = [A(1), A(2), A(1)]
temp[1] == temp[3] # returns true
unique(temp) # returns three elements not [A(1), A(2)] as expected
Is there something trivial I am forgetting here?
Digging deeper into this, it seems that all variants of unique eventually call hash to figure out whether two objects are the same. In the case of a generic iterate, this happens in the check !in(x, seen) where seen is a Set which uses hashes. In the case of an AbstractArray hashes are used directly.
So I presume this means I am required to implement hash as well as == to use unique? Should the documentation be updated to reflect this?
I should add, that in my current case implementing a hash is difficult whereas == is trivial, because I am using CxxWrap.jl objects.
Oh I see - I didn’t dig down into the isequal documentation once I had read that unique uses isequal and verified that isequal was behaving correctly at the REPL.
That makes sense now. However, I think the documentation of unique is then just plain wrong. The implementation of unique never calls isequal, it only uses hash. Maybe the unique documentation could be changed to read:
Return an array containing only the unique elements of collection itr, as determined by hash, in the order that …
Does that sound right?
Edit: I realise I forgot to ask, should I submit a PR for this?
The only call to isequal that I can see relevant to isunique is in _groupedunique! which is only called from the unique! entrypoint and only if the first argument is a Vector of reals and is already sorted.
Oh I think I see the issue now. Thinking from the point of view of unique and pretending I don’t know it uses a Set/Dict internally, I can think of it as:
Check if hashes differ - if so then items are different.
From a comment by @yuyichao in the GH PR, it finally clicked for me that unique is a function that acts on sets, which makes perfect sense for where it is defined in set.jl.
But this makes me realise that I need a function unique which never tries to hash the items in the list. I pretty much need a function that relies only on transitivity of ==. This might be best explained as a “group by” operation followed by selecting one element from each group. Here’s a crappy implementation:
function group_by(itr ; by=Base.:(==))
groups = Vector{Any}[]
for item in itr
ind = findfirst(x -> by(x, first(g)), groups)
if ind === nothing
push!(groups, Any[item])
else
push!(groups[ind], item)
end
end
groups
end
function unique_by(itr ; by=Base.:(==))
groups = group_by(itr ; by)
return first.(groups)
end
This kind of unique does not have any guarantees over which item is returned, as all of the items in each group may contain hidden state (e.g. I could have just as easily used last instead of first). But every item in each group must satisfy x == y.
To bring this back to my use case, I have objects that have been wrapped by CxxWrap.jl and so are only pointers. I can invoke a C++ function a == b but I don’t have any sensible way to give these objects a hash.
The unique_by above could be made more efficient (it doesn’t need to form the groups first) but I think this is illustrating the difference between a Dict and a set of lists.