MATLAB-like `unique` where you can recreate the original with indices

iamed2 · March 16, 2022, 1:05pm

Is there an implementation of the MATLAB 3-output unique in Julia? In particular I’m looking for an efficient way to recreate the original input from the unique elements, like C(ic) in MATLAB.

Previous discussions have focused on the indices of the unique elements (the ia output), but I haven’t found an implementation featuring the ic output.

jules · March 16, 2022, 1:17pm

How about this:

function uniq(x)
    d = Dict{eltype(x), Int}()
    ia = Int[]
    ic = Vector{Int}(undef, length(x))
    k = Ref(0)
    for (i, xx) in enumerate(x)
        ic[i] = get!(d, xx) do
            push!(ia, i)
            k[] += 1
        end
    end

    u = Vector{eltype(x)}(undef, length(d))
    for (key, val) in d
        u[val] = key
    end

    u, ia, ic
end

julia> x = [1, 3, 3, 2, 1, 2, 1, 2, 2, 3]
10-element Vector{Int64}:
 1
 3
 3
 2
 1
 2
 1
 2
 2
 3

julia> uniq(x)
([1, 3, 2], [1, 2, 4], [1, 2, 2, 3, 1, 3, 1, 3, 3, 2])

piever · March 16, 2022, 2:03pm

If you only want unique values and indices ic, you can also use PooledArrays, eg

using PooledArrays: PooledArray
using DataAPI: refpool, refarray

v = rand(["a", "b", "c"], 100)
p = PooledArray(v)
@assert v == refpool(p)[refarray(p)]

iamed2 · March 16, 2022, 5:37pm

@mcabbott mentioned GroupSlices.jl, which has all the functions necessary to generate the vectors, but is several times slower than `unique.

I managed to get an implementation that only works for vectors that replicates what Octave does, and it’s pretty close to the performance of unique while providing the two extra vectors. Unfortunately, it’s not the “stable” unique that Julia currently has (that preserves order of first appearance), it’s a “sorted” unique.

julia> @benchmark unique(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   80.909 μs …  12.138 ms  ┊ GC (min … max):  0.00% … 98.59%
 Time  (median):      86.341 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   104.336 μs ± 301.032 μs  ┊ GC (mean ± σ):  10.56% ±  3.68%

  ▅▆▆█▇▄▂▁▁     ▁▂▃▃▃▂▂▂▂▁▁▁▁                                   ▂
  █████████████████████████████▇▇▇▆▇█▇▇▇▇▇▇▇██▇▇▆▅▆▆▅▅▅▅▆▅▆▅▅▅▅ █
  80.9 μs       Histogram: log(frequency) by time        166 μs <

 Memory estimate: 65.95 KiB, allocs estimate: 27.

julia> @benchmark matlab_unique(A)  # 3-output with GroupSlices
BenchmarkTools.Trial: 2633 samples with 1 evaluation.
 Range (min … max):  1.364 ms … 21.726 ms  ┊ GC (min … max):  0.00% … 92.33%
 Time  (median):     1.578 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   1.890 ms ±  1.643 ms  ┊ GC (mean ± σ):  12.27% ± 13.07%

   █
  ▇█▅▃▂▂▂▂▂▂▂▁▁▂▂▂▁▂▁▁▁▁▁▂▂▁▂▂▂▁▂▂▂▁▁▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▂
  1.36 ms        Histogram: frequency by time        10.8 ms <

 Memory estimate: 1.15 MiB, allocs estimate: 3840.

julia> @benchmark matlab_unique2(A)  # 3-output using sortperm, comparison, and index math, like Octave or old MATLAB
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   85.626 μs … 29.064 ms  ┊ GC (min … max):  0.00% … 99.43%
 Time  (median):     120.276 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   259.169 μs ±  1.356 ms  ┊ GC (mean ± σ):  50.93% ± 10.34%

  █▂                                                           ▁
  ██▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▄▁▃▁▃▄▄ █
  85.6 μs       Histogram: log(frequency) by time      7.23 ms <

 Memory estimate: 498.28 KiB, allocs estimate: 21.

julia> @benchmark matlab_unique3(A)  # PooledArray, only returns 1st and 3rd outputs
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   89.099 μs …  30.173 ms  ┊ GC (min … max):  0.00% … 99.31%
 Time  (median):     101.688 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   152.401 μs ± 792.605 μs  ┊ GC (mean ± σ):  27.92% ±  6.43%

  ▄▆▆▇▇█▇▄▃▂▂▂▂▂▃▃▄▃▃▂      ▁▁▁▁                                ▂
  ████████████████████████████████▇▅▇▆▆▇▇█▇█▇▆▆▆▃▅▃▆▆▆▇▆▇▆▄▅▄▃▅ █
  89.1 μs       Histogram: log(frequency) by time        229 μs <

 Memory estimate: 204.91 KiB, allocs estimate: 36.

The PooledArray approach is promising as it matches the output of unique; it would be nice to have a good way to get ia from that. Using indexin would nearly double the runtime, placing it above matlab_unique2.

piever · March 16, 2022, 6:32pm

StructArrays also has some internal machinery that does the sorted approach. For the non-sorted I don’t know much other than PooledArrays, but it seems cheap enough to get ia from ic?

Something like

function find_ia(pool, ic)
    ia = fill(zero(eltype(ic)), length(pool))
    for (i, idx) in enumerate(ic)
        (ia[idx] == 0) && (ia[idx] = i)
    end
    return ia
end

iamed2 · March 23, 2022, 2:02pm

Sorry, meant to reply earlier, but you’re absolutely right. Thank you!

I do need to remember not to @benchmark blocks of code; I almost tried to completely rewrite that to perform the same operation outside of an enumerate loop, but I just needed to do proper function wrapping. For some reason my instincts wrongly told me that wrapping in a scope-creating block had the same effect as wrapping in a function.

That combined with the rest of matlab_unique3 has proven the best. Now to hopefully generalize a bit more and maybe publish something nice to prove that it’s useful.

piever · March 23, 2022, 8:37pm

Yeah, here it’s a bit tricky because somehow the PooledArray constructor is type-unstable

julia> using PooledArrays, Test

julia> @inferred PooledArray(rand(1:3, 100))
ERROR: return type PooledVector{Int64, UInt32, Vector{UInt32}} does not match inferred return type PooledArray{Int64}

so you do need the find_ia function barrier (that had also tricked me initially into believing that find_ia was much more expensive).

iamed2 · March 31, 2022, 6:36pm

Luckily with PooledArray you can just specify your reftype in the constructor and it’s type-stable again, which is trivial for a lot of types, and can rely on UInt64 or even Int as a reasonable fallback.

t-bltg · December 8, 2023, 8:05pm

I’ve posted a pure julia implementation in Unique (indices) method similar to MATLAB - #7 by t-bltg.

Topic		Replies	Views
Unique (indices) method similar to MATLAB General Usage question	7	3493	December 8, 2023
Equal method to matlab unique in rows General Usage question	2	306	November 14, 2022
Is there a function similar to numpy unique with inverse? General Usage question	5	1787	May 12, 2022
Julia version of padarray in matlab New to Julia question	9	2090	April 15, 2020
Is there a Julia equivalent of Matlab's Array(i) = Value for undefined indicies? New to Julia question , matlab , array	15	777	March 15, 2024

MATLAB-like `unique` where you can recreate the original with indices

Related topics