Unique! and sorted collections

Consider

a = rand(1:10_000_000, 100_000_000)
b = sort(a)
c = copy(a)

unique!(c) takes 6.79s.
unique!(b) takes 0.35s.

However, it seems to me that unique! does an issorted first. If we call _groupedunique! first, we save the O(n) (worst case, which is what happens when you’re checking a sorted list) check of issorted() and get 0.28s.

Is there a reason that there isn’t an exported uniquesorted!() (or equivalent) function that bypasses the issorted() check when we know that the vector is already sorted?

I can’t imagine why adding the keyword argument sorted to unique! in 1.1 would be objectionable…

4 Likes

Also it would be good to allow metadata to stored with the vector. Eg. Sorted

The issue with that approach is that the metadata would probably be encoded as a type. For example, a SortedVector{Int64}, which would allow unique! and friends to dispatch on this extra information and reap the performance benefits.
However, if some function overly restricted its input type, for instance Vector{T} where T <: Number, then the SortedVector wouldn’t work at all!
For a real life example of this, look at the Linear Algebra code in Base. There has been a huge proliferation of new types to describe different properties of matrices. When this information is used to dispatch to optimized methods, there is a huge performance payoff, but if a method is missing, then the calculations resort to slow fallback methods. Adding all of the appropriate methods continues to be a big challenge.

This is where traits can help.

It would only be not objectionable if sorted would be a keyword hint to skip the issorted check, but not to let it choose if the sorting check is done at all, because this would introduce regressions. I’m undecided if the use case is there in real life (e.g. the unique versions that take a function as argument have bugs and nobody has noticed them yet, where i feel, that it’s just not used that often).

In my PR: https://github.com/JuliaLang/julia/pull/29038 where I tried to improve the performance of all unique[!]([f]) functions I’ve already added a issorted argument to the internal implementation, that could easily be made a key word argument for the external function.