Collect values in a dict

I have a list of pairs of dog-owners and dog-names.
The dog-names are non-unique, and I want to collect a dict keyed on the dog-names where the values are the owners having a dog of that name.

Here’s an example using numbers to make it simple:

Given this:


julia> l = [x => x%3 for x in 1:10]
10-element Vector{Pair{Int64, Int64}}:
  1 => 1
  2 => 2
  3 => 0
  4 => 1
  5 => 2
  6 => 0
  7 => 1
  8 => 2
  9 => 0
 10 => 1

I want this:

julia> d
Dict{Int64, Vector{Int64}} with 3 entries:
  0 => [3, 6, 9]
  2 => [2, 5, 8]
  1 => [1, 4, 7, 10]

Here’s a way to do it, but I find it complicated and hard to read:

l = [x => x%3 for x in 1:10]
d = Dict{Int, Vector{Int}}()

for e in l
    push!(d, e[2] => push!(get(d,e[2],Int[]), e[1]))
end

What is the “idiomatic” way of doing this?

Thanks a lot.

Not sure if it is more “idiomatic”, but I find that get! is a very useful function for things like this:

for (x,y) in l
    v = get!(d, y) do
        Int[]
    end
    push!(v, x)
end
1 Like

Thanks @Per ,

The combination of get! and push! is indeed very nice.

I actually find it easier to read without the temporary v:

l = [x => x%3 for x in 1:10];
d = Dict{Int, Vector{Int}}();
for (x,y) in l
    push!(get!(d,y,Int[]), x)
end
d

Dict{Int64, Vector{Int64}} with 3 entries:
  0 => [3, 6, 9]
  2 => [2, 5, 8]
  1 => [1, 4, 7, 10]

Or even:

l = [x => x%3 for x in 1:10];
d = Dict{Int, Vector{Int}}()
[push!(get!(d,y,Int[]), x) for (x,y) in l];
d

Dict{Int64, Vector{Int64}} with 3 entries:
  0 => [3, 6, 9]
  2 => [2, 5, 8]
  1 => [1, 4, 7, 10]

I’ll wait a bit marking yours as Solution to see if others have alternatives.

I find your initial version quite concise already.
In your real example is l really a list or a dict? I guess it does not matter too much.

I will provide a ‘nastier’ (and probably slower) solution for the existing suggestion to shine :slight_smile:

ldict = Dict(x => x%3 for x in 1:10)

    d = Dict{Int, Vector{Int}}()
    vals = collect(values(ldict))
    ks = collect(keys(ldict))
    for v in unique(vals)
        d[v] = ks[findall(isequal(v),vals)]
    end
    d

This version allocates a new empty Int[] array on every iteration regardless of whether it is actually needed.

If you don’t want to define a variable v, you can still do:

push(get!(d, y) do; Int[]; end, x)

or equivalently push(get!(() -> Int[], d, y), x) (skipping the do syntax).

In addition to allocating an Int[] array on each iteration, this version also allocates an array of the results of the comprehension (a list of push! results), even though you don’t need it. Don’t use comprehensions to write loops if you don’t want to construct an array of results.

I think it’s a bit nicer with a DefaultDict rather than explicit get!:

using DataStructures
d = DefaultDict(()->Int[])
for (k,v) in l
    push!(d[v], k)
end
1 Like

Or (my favorite) push(get!(Vector{Int} , d, y), x)

5 Likes

Thank you all!

For my current application this is not performance critical, but I guess this being Julia, the idiomatic way is the performant one :smile:.

So I benchmarked all of the proposed solutions (on 1000 elements in 20 classes).

The solutions fall into a number of categories:

  • The 3 solutions I proposed are all taking more than 50 µs and 1100 allocations
  • The solutions by @Per , @stevengj , @kristoffer.carlsson using a function to provide defaults, are the most efficient at 25 µs and 128 allocations
  • The solution by @yha using DefaultDict takes about the same time as mine, but half the allocations.
  • The solution by @bernhard using findall takes 92 µs and 188 allocations (and is similar to how I would probably have done in R),

I’ll try to remember the one by @kristoffer.carlsson , which looks “right” to me.

It of course also works in a comprehension, though that does have a bit of a bad smell:

l = [x => x%3 for x in 1:10];
d = Dict{Int, Vector{Int}}()
[ push!(get!(Vector{Int} , d, y), x) for (x,y) in l ];
d

Thanks again, below is the code and benchmark numbers.

Code
using  DataStructures, BenchmarkTools

function f1(;cases=10, classes = 3)
    l = [x => x % classes for x in 1:cases];
    d = Dict{Int, Vector{Int}}();
    for e in l
        push!(d, e[2] => push!(get(d,e[2],Int[]), e[1]))
    end
    d
end

function f2(;cases=10, classes = 3)
    l = [x => x % classes for x in 1:cases];
    d = Dict{Int, Vector{Int}}();
    for (x,y) in l
         v = get!(d, y) do
             Int[]
         end
        push!(v, x)
    end
    d
end

function f3(;cases=10, classes = 3)
    l = [x => x % classes for x in 1:cases];
    d = Dict{Int, Vector{Int}}();
    for (x,y) in l
        push!(get!(d,y,Int[]), x)
    end
    d
end

function f4(;cases=10, classes = 3)
    l = [x => x % classes for x in 1:cases];
    d = Dict{Int, Vector{Int}}();
    [push!(get!(d,y,Int[]), x) for (x,y) in l];
    d
end

function f5(;cases=10, classes = 3)
ldict = Dict(x => x % classes for x in 1:cases)

    d = Dict{Int, Vector{Int}}()
    vals = collect(values(ldict))
    ks = collect(keys(ldict))
    for v in unique(vals)
        d[v] = ks[findall(isequal(v),vals)]
    end
    d
end

function f6(;cases=10, classes = 3)
    l = [x => x % classes for x in 1:cases];
    d = Dict{Int, Vector{Int}}();
    for (x,y) in l
        push!(get!(() -> Int[], d, y), x)
    end    
    d
end

function f7(;cases=10, classes = 3)
    l = [x => x % classes for x in 1:cases];
    d = DefaultDict(()->Int[])
    for (k,v) in l
        push!(d[v], k)
    end
    d
end


function f8(;cases=10, classes = 3)
    l = [x => x % classes for x in 1:cases];
    d = Dict{Int, Vector{Int}}();
    for (x,y) in l
        push!(get!(Vector{Int} , d, y), x)
    end    
    d
end




@info "Original solution"
@btime f1(cases = 1000; classes = 20)
@info "Per"
@btime f2(cases = 1000; classes = 20)
@info "tp2750 v2"
@btime f3(cases = 1000; classes = 20)
@info "tp2750 v3"
@btime f4(cases = 1000; classes = 20)
@info "berhard"
@btime f5(cases = 1000; classes = 20)
@info "stevengj"
@btime f6(cases = 1000; classes = 20)
@info "yha"
@btime f7(cases = 1000; classes = 20)
@info "kristoffer.carlsson"
@btime f8(cases = 1000; classes = 20)

[ Info: Original solution
  55.791 μs (1108 allocations: 117.09 KiB)
[ Info: Per
  24.631 μs (128 allocations: 40.53 KiB)
[ Info: tp2750 v2
  50.627 μs (1108 allocations: 117.09 KiB)
[ Info: tp2750 v3
  52.914 μs (1109 allocations: 125.03 KiB)
[ Info: berhard
  92.287 μs (188 allocations: 88.67 KiB)
[ Info: stevengj
  24.479 μs (128 allocations: 40.53 KiB)
[ Info: yha
  53.809 μs (617 allocations: 48.17 KiB)
[ Info: kristoffer.carlsson
  24.647 μs (128 allocations: 40.53 KiB)

The difference here is due to the DefaultDict being untyped, which is compared to a typed dict in the other benchmarks. If you use d = DefaultDict{Int,Vector{Int}}(()->Int[]) it performs the same as the get!-based solutions on my machine.
Surprisingly, I see that DefaultDict{Int,Vector{Int}}(Vector{Int}) performs worse, which doesn’t happen with get!(Vector{Int}...).

1 Like

Thank you for the clarification @yha

DefaultDict definitely gives the shortest and cleanest solution.