MATLAB-like `unique` where you can recreate the original with indices

Is there an implementation of the MATLAB 3-output unique in Julia? In particular I’m looking for an efficient way to recreate the original input from the unique elements, like C(ic) in MATLAB.

Previous discussions have focused on the indices of the unique elements (the ia output), but I haven’t found an implementation featuring the ic output.

2 Likes

How about this:

function uniq(x)
    d = Dict{eltype(x), Int}()
    ia = Int[]
    ic = Vector{Int}(undef, length(x))
    k = Ref(0)
    for (i, xx) in enumerate(x)
        ic[i] = get!(d, xx) do
            push!(ia, i)
            k[] += 1
        end
    end

    u = Vector{eltype(x)}(undef, length(d))
    for (key, val) in d
        u[val] = key
    end

    u, ia, ic
end
julia> x = [1, 3, 3, 2, 1, 2, 1, 2, 2, 3]
10-element Vector{Int64}:
 1
 3
 3
 2
 1
 2
 1
 2
 2
 3

julia> uniq(x)
([1, 3, 2], [1, 2, 4], [1, 2, 2, 3, 1, 3, 1, 3, 3, 2])
1 Like

If you only want unique values and indices ic, you can also use PooledArrays, eg

using PooledArrays: PooledArray
using DataAPI: refpool, refarray

v = rand(["a", "b", "c"], 100)
p = PooledArray(v)
@assert v == refpool(p)[refarray(p)]
4 Likes

@mcabbott mentioned GroupSlices.jl, which has all the functions necessary to generate the vectors, but is several times slower than `unique.

I managed to get an implementation that only works for vectors that replicates what Octave does, and it’s pretty close to the performance of unique while providing the two extra vectors. Unfortunately, it’s not the “stable” unique that Julia currently has (that preserves order of first appearance), it’s a “sorted” unique.

julia> @benchmark unique(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   80.909 μs …  12.138 ms  ┊ GC (min … max):  0.00% … 98.59%
 Time  (median):      86.341 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   104.336 μs ± 301.032 μs  ┊ GC (mean ± σ):  10.56% ±  3.68%

  ▅▆▆█▇▄▂▁▁     ▁▂▃▃▃▂▂▂▂▁▁▁▁                                   ▂
  █████████████████████████████▇▇▇▆▇█▇▇▇▇▇▇▇██▇▇▆▅▆▆▅▅▅▅▆▅▆▅▅▅▅ █
  80.9 μs       Histogram: log(frequency) by time        166 μs <

 Memory estimate: 65.95 KiB, allocs estimate: 27.

julia> @benchmark matlab_unique(A)  # 3-output with GroupSlices
BenchmarkTools.Trial: 2633 samples with 1 evaluation.
 Range (min … max):  1.364 ms … 21.726 ms  ┊ GC (min … max):  0.00% … 92.33%
 Time  (median):     1.578 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   1.890 ms ±  1.643 ms  ┊ GC (mean ± σ):  12.27% ± 13.07%

   █
  ▇█▅▃▂▂▂▂▂▂▂▁▁▂▂▂▁▂▁▁▁▁▁▂▂▁▂▂▂▁▂▂▂▁▁▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▂
  1.36 ms        Histogram: frequency by time        10.8 ms <

 Memory estimate: 1.15 MiB, allocs estimate: 3840.

julia> @benchmark matlab_unique2(A)  # 3-output using sortperm, comparison, and index math, like Octave or old MATLAB
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   85.626 μs … 29.064 ms  ┊ GC (min … max):  0.00% … 99.43%
 Time  (median):     120.276 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   259.169 μs ±  1.356 ms  ┊ GC (mean ± σ):  50.93% ± 10.34%

  █▂                                                           ▁
  ██▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▄▁▃▁▃▄▄ █
  85.6 μs       Histogram: log(frequency) by time      7.23 ms <

 Memory estimate: 498.28 KiB, allocs estimate: 21.

julia> @benchmark matlab_unique3(A)  # PooledArray, only returns 1st and 3rd outputs
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   89.099 μs …  30.173 ms  ┊ GC (min … max):  0.00% … 99.31%
 Time  (median):     101.688 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   152.401 μs ± 792.605 μs  ┊ GC (mean ± σ):  27.92% ±  6.43%

  ▄▆▆▇▇█▇▄▃▂▂▂▂▂▃▃▄▃▃▂      ▁▁▁▁                                ▂
  ████████████████████████████████▇▅▇▆▆▇▇█▇█▇▆▆▆▃▅▃▆▆▆▇▆▇▆▄▅▄▃▅ █
  89.1 μs       Histogram: log(frequency) by time        229 μs <

 Memory estimate: 204.91 KiB, allocs estimate: 36.

The PooledArray approach is promising as it matches the output of unique; it would be nice to have a good way to get ia from that. Using indexin would nearly double the runtime, placing it above matlab_unique2.

1 Like

StructArrays also has some internal machinery that does the sorted approach. For the non-sorted I don’t know much other than PooledArrays, but it seems cheap enough to get ia from ic?

Something like

function find_ia(pool, ic)
    ia = fill(zero(eltype(ic)), length(pool))
    for (i, idx) in enumerate(ic)
        (ia[idx] == 0) && (ia[idx] = i)
    end
    return ia
end

Sorry, meant to reply earlier, but you’re absolutely right. Thank you!

I do need to remember not to @benchmark blocks of code; I almost tried to completely rewrite that to perform the same operation outside of an enumerate loop, but I just needed to do proper function wrapping. For some reason my instincts wrongly told me that wrapping in a scope-creating block had the same effect as wrapping in a function.

That combined with the rest of matlab_unique3 has proven the best. Now to hopefully generalize a bit more and maybe publish something nice to prove that it’s useful.

2 Likes

Yeah, here it’s a bit tricky because somehow the PooledArray constructor is type-unstable

julia> using PooledArrays, Test

julia> @inferred PooledArray(rand(1:3, 100))
ERROR: return type PooledVector{Int64, UInt32, Vector{UInt32}} does not match inferred return type PooledArray{Int64}

so you do need the find_ia function barrier (that had also tricked me initially into believing that find_ia was much more expensive).

Luckily with PooledArray you can just specify your reftype in the constructor and it’s type-stable again, which is trivial for a lot of types, and can rely on UInt64 or even Int as a reasonable fallback.