Note that unique(Ref(x)) doesnβt do what you think it does. Ref(x) effectively makes x βlook like a scalarβ for unique which is why it just returns x itself - the only unique element - in a vector. And, of course, thatβs much cheaper / faster than actually finding the unique values.
If you donβt care about the order of the returned elements, you can try something like the following, which seems to reduce the timings by a factor ~2:
x = rand(10_000_000);
# Assumes `xs` is already sorted.
function unique_sorted(xs)
xprev = first(xs)
ys = [xprev]
for x β xs
if x != xprev
push!(ys, x)
end
xprev = x
end
ys
end
function unique_sort(xs)
ys = sort(xs)
unique_sorted(ys)
end
julia> using BenchmarkTools
julia> @benchmark unique($x)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min β¦ max): 1.813 s β¦ 2.103 s β GC (min β¦ max): 4.31% β¦ 5.35%
Time (median): 2.045 s β GC (median): 3.82%
Time (mean Β± Ο): 1.987 s Β± 153.230 ms β GC (mean Β± Ο): 3.32% Β± 2.64%
β β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.81 s Histogram: frequency by time 2.1 s <
Memory estimate: 323.67 MiB, allocs estimate: 80.
julia> @benchmark unique_sort($x)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
Range (min β¦ max): 878.788 ms β¦ 986.507 ms β GC (min β¦ max): 0.43% β¦ 8.84%
Time (median): 893.003 ms β GC (median): 0.04%
Time (mean Β± Ο): 905.269 ms Β± 40.648 ms β GC (mean Β± Ο): 1.69% Β± 3.57%
ββ β ββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
879 ms Histogram: frequency by time 987 ms <
Memory estimate: 223.13 MiB, allocs estimate: 20.
Thank you @jipolanco , this is helpful and improve performance but it is still slower than R. I think I can write something in C which is faster than the R version. I will also try to write it in Julia to see how it compares. Thanks again.
Instead of unique(x), another way of writing unique sorted:
function uniquesorted(x)
y = sort(x)
y[diff([y; Inf]) .!= 0]
end
x = rand(10_000_000);
using BenchmarkTools
@btime unique($x) # 1.254 s (80 allocations: 360 MiB)
@btime uniquesorted($x) # 932 ms (12 allocations: 306 MiB)