Vector to lists of indices grouped by key

FireCrumb · July 31, 2022, 3:27am

Hi, I have a vector of numbers: (in reality much larger)

arr = Vector{Int64}([6, 9, 9, 4, 1, 1, 2, 7, 8, 3])

I want to be able to generate the list of indices for each value:

> p = pairs(arr)

pairs(::Vector{Int64})(...):
  1  => 6
  2  => 9
  3  => 9
  4  => 4
  5  => 1
  6  => 1
  7  => 2
  8  => 7
  9  => 8
  10 => 3

Desired result:

1 => [5,6]
3 => [10]
4 => [4],
6 => [1]
7 => [2]
8 => [9]
9 => [2,3]

Tried to play with map, reduce, mapreduce, list comprehension, dictionary creation, but to no avail…
Please note that I want to do it in one efficient pass rather than iterating by each possible value in arr and going over array many times.
Thanks!

stillyslalom · July 31, 2022, 3:49am

julia> d = Dict{Int, Vector{Int}}()
Dict{Int64, Vector{Int64}}()

julia> for (k, v) in p
           c = get!(d, v, Int[]) # get array for key, or create if absent
           push!(c, k) # push new index to array
       end

julia> d
Dict{Int64, Vector{Int64}} with 8 entries:
  4 => [4]
  6 => [1]
  7 => [8]
  2 => [7]
  9 => [2, 3]
  8 => [9]
  3 => [10]
  1 => [5, 6]

bkamins · July 31, 2022, 6:04am

Using:

using SplitApplyCombine
group(arr, eachindex(arr))

is an alternative (and it should be faster for large arr)

FireCrumb · July 31, 2022, 7:21am

SplitApplyCombine.jl/src/group.jl

The implementation looks very similar to the solution from above:

function group(groups, values)
    I = eltype(groups) # TODO EltypeUnknown
    T = eltype(values) # TODO EltypeUnknown

    out = Dictionary{I, Vector{T}}()
    @inbounds for (group, value) in zip(groups, values)
        tmp = get!(() -> T[value], out, group)
        last(tmp) == value || push!(tmp, value)
    end

    return out
end

Why do you think it would be faster for large array?

bkamins · July 31, 2022, 8:16am

First note that you have shared the code of an incorrect method for group, but this is a minor issue.

The implementation is similar but not identical and it is faster, here is an example:

julia> using SplitApplyCombine

julia> using BenchmarkTools

julia> arr = rand(1:1000, 10^6);

julia> function sol(arr)
           d = Dict{Int, Vector{Int}}()
           for (k, v) in pairs(arr)
               c = get!(d, v, Int[])
               push!(c, k)
           end
           return d
       end
sol (generic function with 1 method)

julia> @benchmark sol($arr)
BenchmarkTools.Trial: 84 samples with 1 evaluation.
 Range (min … max):  51.139 ms … 79.473 ms  ┊ GC (min … max): 5.99% … 5.12%
 Time  (median):     59.537 ms              ┊ GC (median):    6.03%
 Time  (mean ± σ):   59.924 ms ±  4.510 ms  ┊ GC (mean ± σ):  5.82% ± 1.20%

                         ▁ █
  ▃▁▁▁▃▃▃▃▁▃▆▅▃▆▃▃▃▃▃█▆▅▆█▆██▆▃▅▃▃▅▆▁▃▅▅▅▁▃▃▁▃▁▃▃▁▁▁▁▁▁▁▃▅▁▁▃ ▁
  51.1 ms         Histogram: frequency by time        70.8 ms <

 Memory estimate: 82.41 MiB, allocs estimate: 1005018.

julia> @benchmark group($arr, eachindex($arr))
BenchmarkTools.Trial: 158 samples with 1 evaluation.
 Range (min … max):  23.603 ms … 42.635 ms  ┊ GC (min … max): 0.00% … 6.79%
 Time  (median):     31.774 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.785 ms ±  3.384 ms  ┊ GC (mean ± σ):  4.44% ± 4.96%

               ▂   ▂ ▂▁ ▁█▂▄ ▅▂ ▄▂▄▁▁▂ ▁
  ▃▁▁▁▅▁▃▃▅▃▅▅▃███▃█▃██▆████▅█████████▆█▅▆▃▃▁▁▁▃▃▅▃▁▁▁▁▁▁▁▁▃▅ ▃
  23.6 ms         Histogram: frequency by time        41.5 ms <

 Memory estimate: 21.45 MiB, allocs estimate: 6024.

To fix the problem with sol one needs to make creation of initial Int[] vector lazy and only done when needed:

julia> function sol2(arr)
           d = Dict{Int, Vector{Int}}()
           for (k, v) in pairs(arr)
               c = get!(() -> Int[], d, v)
               push!(c, k)
           end
           return d
       end
sol2 (generic function with 1 method)

julia> @benchmark sol2($arr)
BenchmarkTools.Trial: 188 samples with 1 evaluation.
 Range (min … max):  22.640 ms … 37.099 ms  ┊ GC (min … max): 0.00% … 3.39%
 Time  (median):     26.503 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.695 ms ±  2.338 ms  ┊ GC (mean ± σ):  1.84% ± 1.88%

          ▃▂ ▂ ▇  ▂ ▃▃▃█▂  █
  ▃▃▁▁▄▅█▇██▆█▅█▆▅█▅█████▅██▆▇█▆▅▁▁▃▁▃▄▅▅▁▁▁▁▄▃▃▁▁▄▃▁▁▁▄▁▁▃▃▃ ▃
  22.6 ms         Histogram: frequency by time        33.8 ms <

 Memory estimate: 21.44 MiB, allocs estimate: 6018.

FireCrumb · July 31, 2022, 8:28am

Thanks for the tip!

Also I guess eachindex(arr) is lazy and possibly faster than pairs(arr) which I used, right?

I copied one of the implementations, the others with some more explicit types are similar enough in structure.
Did I miss any important implementation tricks hidden there besides what you mentioned with lazy initialization?

bkamins · July 31, 2022, 8:44am

This should be equivalent - compiler should optimize this out.

No, this is the most important thing. Everything else, should be similar (that is why I have written that the fact that you shared wrong method code does not matter much). In some cases it might matter that SplitApplyCombine.jl uses Dictionary (which is a custom dictionary) and not Dict (which is standard), but in this test this does not affect the performance.

rocco_sprmnt21 · August 1, 2022, 8:51am

a version of get! () that does not use the pair () function.
seems to have performance comparable to the group () function.

julia> @benchmark sol($arr)
BenchmarkTools.Trial: 96 samples with 1 evaluation.
 Range (min … max):  39.003 ms … 107.194 ms  ┊ GC (min … max):  8.26% … 15.59%
 Time  (median):     48.173 ms               ┊ GC (median):    14.18%
 Time  (mean ± σ):   52.484 ms ±  13.080 ms  ┊ GC (mean ± σ):  14.34% ±  3.03%       

      ▂█▃▅  ▂
  ▄▁▄▅████▇▄█▇▆▆▅▄▃▅▄▄▃▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▃▃▁▁▃▁▁▁▃▁▁▁▁▁▁▁▃▁▃▁▁▁▁▃ ▁
  39 ms           Histogram: frequency by time          102 ms <

 Memory estimate: 82.41 MiB, allocs estimate: 1005018.

julia> @benchmark  group($arr, eachindex($arr))
BenchmarkTools.Trial: 211 samples with 1 evaluation.
 Range (min … max):  18.920 ms … 33.196 ms  ┊ GC (min … max): 0.00% … 18.23%
 Time  (median):     23.611 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.735 ms ±  2.649 ms  ┊ GC (mean ± σ):  6.11% ±  6.49%

  ▃ ▁    ▁▃  ▁▃▃▆     ██▃▄▁██▃▄▆  ▄▁▄ ▃  ▁▃▄ ▄ ▃
  █▆█▇▆▆▆██▆▇████▄▇▄▄▇██████████▇▄███▆█▄▆███▆█▇█▄▇▇▇▇▄▆▄▁▄▁▁▄ ▆
  18.9 ms         Histogram: frequency by time        29.5 ms <

 Memory estimate: 21.45 MiB, allocs estimate: 6024.

julia> @benchmark begin
       d = Dict{Int, Vector{Int}}()
       foreach(((i,e),)->push!(get!(()->[], d,e),i), enumerate(arr))   
       d
       end
BenchmarkTools.Trial: 241 samples with 1 evaluation.
 Range (min … max):  18.828 ms … 24.946 ms  ┊ GC (min … max): 0.00% … 6.36%
 Time  (median):     20.689 ms              ┊ GC (median):    0.00%    
 Time  (mean ± σ):   20.774 ms ±  1.143 ms  ┊ GC (mean ± σ):  3.25% ± 3.29%

      ▂▄█  ▂ ▁▂    ▄▂▂▂▁▅█▂▅▅ ▇ ▁▂ ▂  ▂▁ ▁ ▁     ▁
  ▅▆▆████▁██▆██▆█▅███████████▅█▆████▃█████▅█▃▃▃▅▆█▃▅▃█▃▁▃▃▆▅▃ ▅        
  18.8 ms         Histogram: frequency by time        23.4 ms <        

 Memory estimate: 21.48 MiB, allocs estimate: 7020.

bkamins · August 1, 2022, 9:06am

The crucial change in your code is doing () -> [] part (BTW: it should be () -> Int[])

rocco_sprmnt21 · August 1, 2022, 11:00am

I quickly read the topic and I had not seen sol2(), otherwise I would not have posted mine which is practically the same.