# MATLAB-like `unique` where you can recreate the original with indices

Is there an implementation of the MATLAB 3-output `unique` in Julia? In particular I’m looking for an efficient way to recreate the original input from the unique elements, like `C(ic)` in MATLAB.

Previous discussions have focused on the indices of the unique elements (the `ia` output), but I haven’t found an implementation featuring the `ic` output.

2 Likes

``````function uniq(x)
d = Dict{eltype(x), Int}()
ia = Int[]
ic = Vector{Int}(undef, length(x))
k = Ref(0)
for (i, xx) in enumerate(x)
ic[i] = get!(d, xx) do
push!(ia, i)
k[] += 1
end
end

u = Vector{eltype(x)}(undef, length(d))
for (key, val) in d
u[val] = key
end

u, ia, ic
end
``````
``````julia> x = [1, 3, 3, 2, 1, 2, 1, 2, 2, 3]
10-element Vector{Int64}:
1
3
3
2
1
2
1
2
2
3

julia> uniq(x)
([1, 3, 2], [1, 2, 4], [1, 2, 2, 3, 1, 3, 1, 3, 3, 2])
``````
1 Like

If you only want unique values and indices `ic`, you can also use PooledArrays, eg

``````using PooledArrays: PooledArray
using DataAPI: refpool, refarray

v = rand(["a", "b", "c"], 100)
p = PooledArray(v)
@assert v == refpool(p)[refarray(p)]
``````
4 Likes

@mcabbott mentioned GroupSlices.jl, which has all the functions necessary to generate the vectors, but is several times slower than `unique.

I managed to get an implementation that only works for vectors that replicates what Octave does, and it’s pretty close to the performance of `unique` while providing the two extra vectors. Unfortunately, it’s not the “stable” unique that Julia currently has (that preserves order of first appearance), it’s a “sorted” unique.

``````julia> @benchmark unique(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):   80.909 μs …  12.138 ms  ┊ GC (min … max):  0.00% … 98.59%
Time  (median):      86.341 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   104.336 μs ± 301.032 μs  ┊ GC (mean ± σ):  10.56% ±  3.68%

▅▆▆█▇▄▂▁▁     ▁▂▃▃▃▂▂▂▂▁▁▁▁                                   ▂
█████████████████████████████▇▇▇▆▇█▇▇▇▇▇▇▇██▇▇▆▅▆▆▅▅▅▅▆▅▆▅▅▅▅ █
80.9 μs       Histogram: log(frequency) by time        166 μs <

Memory estimate: 65.95 KiB, allocs estimate: 27.

julia> @benchmark matlab_unique(A)  # 3-output with GroupSlices
BenchmarkTools.Trial: 2633 samples with 1 evaluation.
Range (min … max):  1.364 ms … 21.726 ms  ┊ GC (min … max):  0.00% … 92.33%
Time  (median):     1.578 ms              ┊ GC (median):     0.00%
Time  (mean ± σ):   1.890 ms ±  1.643 ms  ┊ GC (mean ± σ):  12.27% ± 13.07%

█
▇█▅▃▂▂▂▂▂▂▂▁▁▂▂▂▁▂▁▁▁▁▁▂▂▁▂▂▂▁▂▂▂▁▁▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▂
1.36 ms        Histogram: frequency by time        10.8 ms <

Memory estimate: 1.15 MiB, allocs estimate: 3840.

julia> @benchmark matlab_unique2(A)  # 3-output using sortperm, comparison, and index math, like Octave or old MATLAB
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):   85.626 μs … 29.064 ms  ┊ GC (min … max):  0.00% … 99.43%
Time  (median):     120.276 μs              ┊ GC (median):     0.00%
Time  (mean ± σ):   259.169 μs ±  1.356 ms  ┊ GC (mean ± σ):  50.93% ± 10.34%

█▂                                                           ▁
██▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▄▁▃▁▃▄▄ █
85.6 μs       Histogram: log(frequency) by time      7.23 ms <

Memory estimate: 498.28 KiB, allocs estimate: 21.

julia> @benchmark matlab_unique3(A)  # PooledArray, only returns 1st and 3rd outputs
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):   89.099 μs …  30.173 ms  ┊ GC (min … max):  0.00% … 99.31%
Time  (median):     101.688 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   152.401 μs ± 792.605 μs  ┊ GC (mean ± σ):  27.92% ±  6.43%

▄▆▆▇▇█▇▄▃▂▂▂▂▂▃▃▄▃▃▂      ▁▁▁▁                                ▂
████████████████████████████████▇▅▇▆▆▇▇█▇█▇▆▆▆▃▅▃▆▆▆▇▆▇▆▄▅▄▃▅ █
89.1 μs       Histogram: log(frequency) by time        229 μs <

Memory estimate: 204.91 KiB, allocs estimate: 36.
``````

The PooledArray approach is promising as it matches the output of `unique`; it would be nice to have a good way to get `ia` from that. Using `indexin` would nearly double the runtime, placing it above `matlab_unique2`.

1 Like

StructArrays also has some internal machinery that does the sorted approach. For the non-sorted I don’t know much other than PooledArrays, but it seems cheap enough to get `ia` from `ic`?

Something like

``````function find_ia(pool, ic)
ia = fill(zero(eltype(ic)), length(pool))
for (i, idx) in enumerate(ic)
(ia[idx] == 0) && (ia[idx] = i)
end
return ia
end
``````

Sorry, meant to reply earlier, but you’re absolutely right. Thank you!

I do need to remember not to `@benchmark` blocks of code; I almost tried to completely rewrite that to perform the same operation outside of an enumerate loop, but I just needed to do proper function wrapping. For some reason my instincts wrongly told me that wrapping in a scope-creating block had the same effect as wrapping in a function.

That combined with the rest of `matlab_unique3` has proven the best. Now to hopefully generalize a bit more and maybe publish something nice to prove that it’s useful.

2 Likes

Yeah, here it’s a bit tricky because somehow the `PooledArray` constructor is type-unstable

``````julia> using PooledArrays, Test

julia> @inferred PooledArray(rand(1:3, 100))
ERROR: return type PooledVector{Int64, UInt32, Vector{UInt32}} does not match inferred return type PooledArray{Int64}
``````

so you do need the `find_ia` function barrier (that had also tricked me initially into believing that `find_ia` was much more expensive).

Luckily with PooledArray you can just specify your reftype in the constructor and it’s type-stable again, which is trivial for a lot of types, and can rely on `UInt64` or even `Int` as a reasonable fallback.