x = rand(10_000_000);
@time unique(x);
# 1.6 seconds
@time unique(Ref(x))[1];
# 0.006373 seconds
unique(x) == unique(Ref(x))[1]
# true

Why would anyone use unique on it owns instead of unique + Ref ?
Canβt Julia optimise this automatically?

Thank you

1 Like

They are not the same:

x = [2, 3, 2, 3]
unique(x)                # [2, 3]
unique(Ref(x))[1]        # [2, 3, 2, 3]
unique(Ref(x))[1] === x  # true

Note that unique(Ref(x))[1] just returns x, since Ref(x) is equivalent here to a vector with a single element x.

2 Likes
julia> x = rand(1:10, 10);

julia> x
10-element Vector{Int64}:
3
6
1
8
7
7
10
6
1
3

julia> unique(x)
6-element Vector{Int64}:
3
6
1
8
7
10

julia> unique(Ref(x))
1-element Vector{Vector{Int64}}:
[3, 6, 1, 8, 7, 7, 10, 6, 1, 3]

Note that unique(Ref(x)) doesnβt do what you think it does. Ref(x) effectively makes x βlook like a scalarβ for unique which is why it just returns x itself - the only unique element - in a vector. And, of course, thatβs much cheaper / faster than actually finding the unique values.

2 Likes

Thank you.
Unique in Julia seems to be very slow compare to unique in R for the example I provided β¦ not sure to understand whyβ¦

x = runif(1e7)
system.time(unique(x))
# 0.85 sec
1 Like

If you donβt care about the order of the returned elements, you can try something like the following, which seems to reduce the timings by a factor ~2:

x = rand(10_000_000);

# Assumes `xs` is already sorted.
function unique_sorted(xs)
xprev = first(xs)
ys = [xprev]
for x β xs
if x != xprev
push!(ys, x)
end
xprev = x
end
ys
end

function unique_sort(xs)
ys = sort(xs)
unique_sorted(ys)
end

julia> using BenchmarkTools

julia> @benchmark unique(\$x)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min β¦ max):  1.813 s β¦    2.103 s  β GC (min β¦ max): 4.31% β¦ 5.35%
Time  (median):     2.045 s               β GC (median):    3.82%
Time  (mean Β± Ο):   1.987 s Β± 153.230 ms  β GC (mean Β± Ο):  3.32% Β± 2.64%

β                                             β          β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.81 s         Histogram: frequency by time          2.1 s <

Memory estimate: 323.67 MiB, allocs estimate: 80.

julia> @benchmark unique_sort(\$x)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
Range (min β¦ max):  878.788 ms β¦ 986.507 ms  β GC (min β¦ max): 0.43% β¦ 8.84%
Time  (median):     893.003 ms               β GC (median):    0.04%
Time  (mean Β± Ο):   905.269 ms Β±  40.648 ms  β GC (mean Β± Ο):  1.69% Β± 3.57%

ββ    β   ββ                                                β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
879 ms           Histogram: frequency by time          987 ms <

Memory estimate: 223.13 MiB, allocs estimate: 20.
1 Like

Thank you @jipolanco , this is helpful and improve performance but it is still slower than R. I think I can write something in C which is faster than the R version. I will also try to write it in Julia to see how it compares. Thanks again.

Please share it! It would be nice to see what algorithm R is using.

1 Like

Instead of unique(x), another way of writing unique sorted:

function uniquesorted(x)
y = sort(x)
y[diff([y; Inf]) .!= 0]
end

x = rand(10_000_000);
using BenchmarkTools
@btime unique(\$x)       # 1.254 s (80 allocations: 360 MiB)
@btime uniquesorted(\$x) #  932 ms (12 allocations: 306 MiB)

As a side note: with 64-bit floats and x = rand(10_000_000), the probabability that unique(x) != x should be extremelly low, no?

1 Like