Question about Ref and about unique

Please see the below code:

x = rand(10_000_000);
@time unique(x);
# 1.6 seconds
@time unique(Ref(x))[1];
# 0.006373 seconds
unique(x) == unique(Ref(x))[1]
# true

Why would anyone use unique on it owns instead of unique + Ref ?
Can’t Julia optimise this automatically?

Thank you

1 Like

They are not the same:

x = [2, 3, 2, 3]
unique(x)                # [2, 3]
unique(Ref(x))[1]        # [2, 3, 2, 3]
unique(Ref(x))[1] === x  # true

Note that unique(Ref(x))[1] just returns x, since Ref(x) is equivalent here to a vector with a single element x.

2 Likes
julia> x = rand(1:10, 10);

julia> x
10-element Vector{Int64}:
  3
  6
  1
  8
  7
  7
 10
  6
  1
  3

julia> unique(x)
6-element Vector{Int64}:
  3
  6
  1
  8
  7
 10

julia> unique(Ref(x))
1-element Vector{Vector{Int64}}:
 [3, 6, 1, 8, 7, 7, 10, 6, 1, 3]

Note that unique(Ref(x)) doesn’t do what you think it does. Ref(x) effectively makes x β€œlook like a scalar” for unique which is why it just returns x itself - the only unique element - in a vector. And, of course, that’s much cheaper / faster than actually finding the unique values.

2 Likes

Thank you.
Unique in Julia seems to be very slow compare to unique in R for the example I provided … not sure to understand why…

x = runif(1e7)
system.time(unique(x))
# 0.85 sec
1 Like

If you don’t care about the order of the returned elements, you can try something like the following, which seems to reduce the timings by a factor ~2:

x = rand(10_000_000);

# Assumes `xs` is already sorted.
function unique_sorted(xs)
    xprev = first(xs)
    ys = [xprev]
    for x ∈ xs
        if x != xprev
            push!(ys, x)
        end
        xprev = x
    end
    ys
end

function unique_sort(xs)
    ys = sort(xs)
    unique_sorted(ys)
end


julia> using BenchmarkTools

julia> @benchmark unique($x)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  1.813 s …    2.103 s  β”Š GC (min … max): 4.31% … 5.35%
 Time  (median):     2.045 s               β”Š GC (median):    3.82%
 Time  (mean Β± Οƒ):   1.987 s Β± 153.230 ms  β”Š GC (mean Β± Οƒ):  3.32% Β± 2.64%

  β–ˆ                                             β–ˆ          β–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  1.81 s         Histogram: frequency by time          2.1 s <

 Memory estimate: 323.67 MiB, allocs estimate: 80.

julia> @benchmark unique_sort($x)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  878.788 ms … 986.507 ms  β”Š GC (min … max): 0.43% … 8.84%
 Time  (median):     893.003 ms               β”Š GC (median):    0.04%
 Time  (mean Β± Οƒ):   905.269 ms Β±  40.648 ms  β”Š GC (mean Β± Οƒ):  1.69% Β± 3.57%

  β–ˆβ–ˆ    β–ˆ   β–ˆβ–ˆ                                                β–ˆ  
  β–ˆβ–ˆβ–β–β–β–β–ˆβ–β–β–β–ˆβ–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  879 ms           Histogram: frequency by time          987 ms <

 Memory estimate: 223.13 MiB, allocs estimate: 20.
1 Like

Thank you @jipolanco , this is helpful and improve performance but it is still slower than R. I think I can write something in C which is faster than the R version. I will also try to write it in Julia to see how it compares. Thanks again.

Please share it! It would be nice to see what algorithm R is using.

1 Like

Instead of unique(x), another way of writing unique sorted:

function uniquesorted(x)
   y = sort(x)
   y[diff([y; Inf]) .!= 0]
end

x = rand(10_000_000);
using BenchmarkTools
@btime unique($x)       # 1.254 s (80 allocations: 360 MiB)
@btime uniquesorted($x) #  932 ms (12 allocations: 306 MiB)

As a side note: with 64-bit floats and x = rand(10_000_000), the probabability that unique(x) != x should be extremelly low, no?

1 Like