Unique! and sorted collections

anon94023334 · October 15, 2018, 4:31pm

Consider

a = rand(1:10_000_000, 100_000_000)
b = sort(a)
c = copy(a)

unique!(c) takes 6.79s.
unique!(b) takes 0.35s.

However, it seems to me that unique! does an issorted first. If we call _groupedunique! first, we save the O(n) (worst case, which is what happens when you’re checking a sorted list) check of issorted() and get 0.28s.

Is there a reason that there isn’t an exported uniquesorted!() (or equivalent) function that bypasses the issorted() check when we know that the vector is already sorted?

adamslc · October 15, 2018, 5:52pm

I can’t imagine why adding the keyword argument sorted to unique! in 1.1 would be objectionable…

xiaodai · October 16, 2018, 12:48am

Also it would be good to allow metadata to stored with the vector. Eg. Sorted

adamslc · October 16, 2018, 5:29am

The issue with that approach is that the metadata would probably be encoded as a type. For example, a SortedVector{Int64}, which would allow unique! and friends to dispatch on this extra information and reap the performance benefits.
However, if some function overly restricted its input type, for instance Vector{T} where T <: Number, then the SortedVector wouldn’t work at all!
For a real life example of this, look at the Linear Algebra code in Base. There has been a huge proliferation of new types to describe different properties of matrices. When this information is used to dispatch to optimized methods, there is a huge performance payoff, but if a method is missing, then the calculations resort to slow fallback methods. Adding all of the appropriate methods continues to be a big challenge.

xiaodai · October 16, 2018, 5:50am

This is where traits can help.

laborg · October 16, 2018, 6:12am

It would only be not objectionable if sorted would be a keyword hint to skip the issorted check, but not to let it choose if the sorting check is done at all, because this would introduce regressions. I’m undecided if the use case is there in real life (e.g. the unique versions that take a function as argument have bugs and nobody has noticed them yet, where i feel, that it’s just not used that often).

In my PR: RFC: Implement unique consistently by laborg · Pull Request #29038 · JuliaLang/julia · GitHub where I tried to improve the performance of all unique[!]([f]) functions I’ve already added a issorted argument to the internal implementation, that could easily be made a key word argument for the external function.

Topic		Replies	Views
Add unique!(f, itr) #28415 Internals & Design	6	649	December 5, 2018
Storing sortedness for numeric vectors Internals & Design vector	14	442	January 4, 2025
Fastest way to `unique!` General Usage	17	3202	October 11, 2020
Unique doesn't seem to obey == definitions General Usage	7	580	October 10, 2020
Question about Ref and about unique New to Julia question	8	447	October 22, 2021

Unique! and sorted collections

Related topics