Vector to lists of indices grouped by key

Hi, I have a vector of numbers: (in reality much larger)

arr = Vector{Int64}([6, 9, 9, 4, 1, 1, 2, 7, 8, 3])

I want to be able to generate the list of indices for each value:

> p = pairs(arr)

pairs(::Vector{Int64})(...):
  1  => 6
  2  => 9
  3  => 9
  4  => 4
  5  => 1
  6  => 1
  7  => 2
  8  => 7
  9  => 8
  10 => 3

Desired result:

1 => [5,6]
3 => [10]
4 => [4],
6 => [1]
7 => [2]
8 => [9]
9 => [2,3]

Tried to play with map, reduce, mapreduce, list comprehension, dictionary creation, but to no avail…
Please note that I want to do it in one efficient pass rather than iterating by each possible value in arr and going over array many times.
Thanks!

julia> d = Dict{Int, Vector{Int}}()
Dict{Int64, Vector{Int64}}()

julia> for (k, v) in p
           c = get!(d, v, Int[]) # get array for key, or create if absent
           push!(c, k) # push new index to array
       end

julia> d
Dict{Int64, Vector{Int64}} with 8 entries:
  4 => [4]
  6 => [1]
  7 => [8]
  2 => [7]
  9 => [2, 3]
  8 => [9]
  3 => [10]
  1 => [5, 6]
1 Like

Using:

using SplitApplyCombine
group(arr, eachindex(arr))

is an alternative (and it should be faster for large arr)

1 Like

SplitApplyCombine.jl/src/group.jl

The implementation looks very similar to the solution from above:

function group(groups, values)
    I = eltype(groups) # TODO EltypeUnknown
    T = eltype(values) # TODO EltypeUnknown

    out = Dictionary{I, Vector{T}}()
    @inbounds for (group, value) in zip(groups, values)
        tmp = get!(() -> T[value], out, group)
        last(tmp) == value || push!(tmp, value)
    end

    return out
end

Why do you think it would be faster for large array?

First note that you have shared the code of an incorrect method for group, but this is a minor issue.

The implementation is similar but not identical and it is faster, here is an example:

julia> using SplitApplyCombine

julia> using BenchmarkTools

julia> arr = rand(1:1000, 10^6);

julia> function sol(arr)
           d = Dict{Int, Vector{Int}}()
           for (k, v) in pairs(arr)
               c = get!(d, v, Int[])
               push!(c, k)
           end
           return d
       end
sol (generic function with 1 method)

julia> @benchmark sol($arr)
BenchmarkTools.Trial: 84 samples with 1 evaluation.
 Range (min … max):  51.139 ms … 79.473 ms  ┊ GC (min … max): 5.99% … 5.12%
 Time  (median):     59.537 ms              ┊ GC (median):    6.03%
 Time  (mean ± σ):   59.924 ms ±  4.510 ms  ┊ GC (mean ± σ):  5.82% ± 1.20%

                         ▁ █
  ▃▁▁▁▃▃▃▃▁▃▆▅▃▆▃▃▃▃▃█▆▅▆█▆██▆▃▅▃▃▅▆▁▃▅▅▅▁▃▃▁▃▁▃▃▁▁▁▁▁▁▁▃▅▁▁▃ ▁
  51.1 ms         Histogram: frequency by time        70.8 ms <

 Memory estimate: 82.41 MiB, allocs estimate: 1005018.

julia> @benchmark group($arr, eachindex($arr))
BenchmarkTools.Trial: 158 samples with 1 evaluation.
 Range (min … max):  23.603 ms … 42.635 ms  ┊ GC (min … max): 0.00% … 6.79%
 Time  (median):     31.774 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.785 ms ±  3.384 ms  ┊ GC (mean ± σ):  4.44% ± 4.96%

               ▂   ▂ ▂▁ ▁█▂▄ ▅▂ ▄▂▄▁▁▂ ▁
  ▃▁▁▁▅▁▃▃▅▃▅▅▃███▃█▃██▆████▅█████████▆█▅▆▃▃▁▁▁▃▃▅▃▁▁▁▁▁▁▁▁▃▅ ▃
  23.6 ms         Histogram: frequency by time        41.5 ms <

 Memory estimate: 21.45 MiB, allocs estimate: 6024.

To fix the problem with sol one needs to make creation of initial Int[] vector lazy and only done when needed:

julia> function sol2(arr)
           d = Dict{Int, Vector{Int}}()
           for (k, v) in pairs(arr)
               c = get!(() -> Int[], d, v)
               push!(c, k)
           end
           return d
       end
sol2 (generic function with 1 method)

julia> @benchmark sol2($arr)
BenchmarkTools.Trial: 188 samples with 1 evaluation.
 Range (min … max):  22.640 ms … 37.099 ms  ┊ GC (min … max): 0.00% … 3.39%
 Time  (median):     26.503 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.695 ms ±  2.338 ms  ┊ GC (mean ± σ):  1.84% ± 1.88%

          ▃▂ ▂ ▇  ▂ ▃▃▃█▂  █
  ▃▃▁▁▄▅█▇██▆█▅█▆▅█▅█████▅██▆▇█▆▅▁▁▃▁▃▄▅▅▁▁▁▁▄▃▃▁▁▄▃▁▁▁▄▁▁▃▃▃ ▃
  22.6 ms         Histogram: frequency by time        33.8 ms <

 Memory estimate: 21.44 MiB, allocs estimate: 6018.
3 Likes

Thanks for the tip! :+1:

Also I guess eachindex(arr) is lazy and possibly faster than pairs(arr) which I used, right?

I copied one of the implementations, the others with some more explicit types are similar enough in structure.
Did I miss any important implementation tricks hidden there besides what you mentioned with lazy initialization?

This should be equivalent - compiler should optimize this out.

No, this is the most important thing. Everything else, should be similar (that is why I have written that the fact that you shared wrong method code does not matter much). In some cases it might matter that SplitApplyCombine.jl uses Dictionary (which is a custom dictionary) and not Dict (which is standard), but in this test this does not affect the performance.

1 Like

a version of get! () that does not use the pair () function.
seems to have performance comparable to the group () function.

julia> @benchmark sol($arr)
BenchmarkTools.Trial: 96 samples with 1 evaluation.
 Range (min … max):  39.003 ms … 107.194 ms  ┊ GC (min … max):  8.26% … 15.59%
 Time  (median):     48.173 ms               ┊ GC (median):    14.18%
 Time  (mean ± σ):   52.484 ms ±  13.080 ms  ┊ GC (mean ± σ):  14.34% ±  3.03%       

      ▂█▃▅  ▂
  ▄▁▄▅████▇▄█▇▆▆▅▄▃▅▄▄▃▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▃▃▁▁▃▁▁▁▃▁▁▁▁▁▁▁▃▁▃▁▁▁▁▃ ▁
  39 ms           Histogram: frequency by time          102 ms <

 Memory estimate: 82.41 MiB, allocs estimate: 1005018.

julia> @benchmark  group($arr, eachindex($arr))
BenchmarkTools.Trial: 211 samples with 1 evaluation.
 Range (min … max):  18.920 ms … 33.196 ms  ┊ GC (min … max): 0.00% … 18.23%
 Time  (median):     23.611 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.735 ms ±  2.649 ms  ┊ GC (mean ± σ):  6.11% ±  6.49%

  ▃ ▁    ▁▃  ▁▃▃▆     ██▃▄▁██▃▄▆  ▄▁▄ ▃  ▁▃▄ ▄ ▃
  █▆█▇▆▆▆██▆▇████▄▇▄▄▇██████████▇▄███▆█▄▆███▆█▇█▄▇▇▇▇▄▆▄▁▄▁▁▄ ▆
  18.9 ms         Histogram: frequency by time        29.5 ms <

 Memory estimate: 21.45 MiB, allocs estimate: 6024.

julia> @benchmark begin
       d = Dict{Int, Vector{Int}}()
       foreach(((i,e),)->push!(get!(()->[], d,e),i), enumerate(arr))   
       d
       end
BenchmarkTools.Trial: 241 samples with 1 evaluation.
 Range (min … max):  18.828 ms … 24.946 ms  ┊ GC (min … max): 0.00% … 6.36%
 Time  (median):     20.689 ms              ┊ GC (median):    0.00%    
 Time  (mean ± σ):   20.774 ms ±  1.143 ms  ┊ GC (mean ± σ):  3.25% ± 3.29%

      ▂▄█  ▂ ▁▂    ▄▂▂▂▁▅█▂▅▅ ▇ ▁▂ ▂  ▂▁ ▁ ▁     ▁
  ▅▆▆████▁██▆██▆█▅███████████▅█▆████▃█████▅█▃▃▃▅▆█▃▅▃█▃▁▃▃▆▅▃ ▅        
  18.8 ms         Histogram: frequency by time        23.4 ms <        

 Memory estimate: 21.48 MiB, allocs estimate: 7020.

The crucial change in your code is doing () -> [] part (BTW: it should be () -> Int[])

2 Likes

I quickly read the topic and I had not seen sol2(), otherwise I would not have posted mine which is practically the same.