MATLAB-like `unique` where you can recreate the original with indices

@mcabbott mentioned GroupSlices.jl, which has all the functions necessary to generate the vectors, but is several times slower than `unique.

I managed to get an implementation that only works for vectors that replicates what Octave does, and it’s pretty close to the performance of unique while providing the two extra vectors. Unfortunately, it’s not the “stable” unique that Julia currently has (that preserves order of first appearance), it’s a “sorted” unique.

julia> @benchmark unique(A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   80.909 μs …  12.138 ms  ┊ GC (min … max):  0.00% … 98.59%
 Time  (median):      86.341 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   104.336 μs ± 301.032 μs  ┊ GC (mean ± σ):  10.56% ±  3.68%

  ▅▆▆█▇▄▂▁▁     ▁▂▃▃▃▂▂▂▂▁▁▁▁                                   ▂
  █████████████████████████████▇▇▇▆▇█▇▇▇▇▇▇▇██▇▇▆▅▆▆▅▅▅▅▆▅▆▅▅▅▅ █
  80.9 μs       Histogram: log(frequency) by time        166 μs <

 Memory estimate: 65.95 KiB, allocs estimate: 27.

julia> @benchmark matlab_unique(A)  # 3-output with GroupSlices
BenchmarkTools.Trial: 2633 samples with 1 evaluation.
 Range (min … max):  1.364 ms … 21.726 ms  ┊ GC (min … max):  0.00% … 92.33%
 Time  (median):     1.578 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   1.890 ms ±  1.643 ms  ┊ GC (mean ± σ):  12.27% ± 13.07%

   █
  ▇█▅▃▂▂▂▂▂▂▂▁▁▂▂▂▁▂▁▁▁▁▁▂▂▁▂▂▂▁▂▂▂▁▁▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▂
  1.36 ms        Histogram: frequency by time        10.8 ms <

 Memory estimate: 1.15 MiB, allocs estimate: 3840.

julia> @benchmark matlab_unique2(A)  # 3-output using sortperm, comparison, and index math, like Octave or old MATLAB
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   85.626 μs … 29.064 ms  ┊ GC (min … max):  0.00% … 99.43%
 Time  (median):     120.276 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   259.169 μs ±  1.356 ms  ┊ GC (mean ± σ):  50.93% ± 10.34%

  █▂                                                           ▁
  ██▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▄▁▃▁▃▄▄ █
  85.6 μs       Histogram: log(frequency) by time      7.23 ms <

 Memory estimate: 498.28 KiB, allocs estimate: 21.

julia> @benchmark matlab_unique3(A)  # PooledArray, only returns 1st and 3rd outputs
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   89.099 μs …  30.173 ms  ┊ GC (min … max):  0.00% … 99.31%
 Time  (median):     101.688 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   152.401 μs ± 792.605 μs  ┊ GC (mean ± σ):  27.92% ±  6.43%

  ▄▆▆▇▇█▇▄▃▂▂▂▂▂▃▃▄▃▃▂      ▁▁▁▁                                ▂
  ████████████████████████████████▇▅▇▆▆▇▇█▇█▇▆▆▆▃▅▃▆▆▆▇▆▇▆▄▅▄▃▅ █
  89.1 μs       Histogram: log(frequency) by time        229 μs <

 Memory estimate: 204.91 KiB, allocs estimate: 36.

The PooledArray approach is promising as it matches the output of unique; it would be nice to have a good way to get ia from that. Using indexin would nearly double the runtime, placing it above matlab_unique2.

1 Like