Stack overflow when merging 50000 dicts


#1
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _  |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> a = []
0-element Array{Any,1}

julia> for i in 1:50000
       push!(a, Dict(rand() => rand()))
       end

julia> merge(a...)
ERROR: StackOverflowError:
Stacktrace:
 [1] promoteK(::Type{T} where T, ::Dict{Float64,Float64}, ::Dict{Float64,Float64}, ::Vararg{Dict{Float64,Float64},N} where N) at ./associative.jl:290 (repeats 16353 times)
 [2] emptymergedict(::Dict{Float64,Float64}, ::Dict{Float64,Float64}, ::Vararg{Dict{Float64,Float64},N} where N) at ./associative.jl:293
 [3] merge(::Dict{Float64,Float64}, ::Dict{Float64,Float64},::Dict{Float64,Float64}, ::Vararg{Dict{Float64,Float64},N} where N) at ./associative.jl:256

I want to ask what is the prefered way to merge a list of dicts? Thanks in advance!


#2

Okay, found a solution: merge!(Dict(), a...) is very fast…


Way to construct multi-key ImmutableDict
#3

Try

reduce(merge, a)

#4

reduce can do the job, good to know, thanks!


#5

merge! is in general much faster than reduce for this purpose. The later will compute the narrowest output type and allocate a new dict as each input dict is merged. The (potential) disadvantage of merge! is that you have to choose the target key and value types yourself, and it will throw an error if the keys or values encountered are incompatible.

I have submitted a PR to make reduce more efficient for merge. But, there are several complicating issues, and it is not yet ready to be merged. Without recounting the whole story, I can say that I favor merge! if you know (or don’t care) about the type of the output dict. Reasons are 1. it is fairly compact. 2. It is faster than any optimized reduce. 3. Reading the expression gives more information on the ouput type. 4. An error will be thrown if your assumptions about the input dicts are violated. On the other hand, in end-user code, it may not make much difference which one you choose.

Note that @btime shows that merge!(typeof(a[1])(), a...) is twice as fast as merge!(Dict(), a...) in your example. The faster one returns a dict of type Dict{Float64,Float64}, which may or may not be what you want.


#6

Thanks for the detailed info. I can undersand most of the points. Hope a better version of reduce will land.

For my use case, the script only need to run once (for processing a dataset and then save the result). So waiting for 5s or 10s does not matters much, as long as it can do the job in a reasonable time :stuck_out_tongue:


#7

You are right, but I am wondering if merging N of M-element dictionaries is a use case to optimize for, if N \gg M. I don’t know what the context for this question is, but I would give

  1. simply collecting vectors of Pairs,
  2. taking the key and value types from the element type of the resulting collection,
  3. starting an empty Dict{K,V}(),
  4. then just merging the pairs into it elementwise with setindex!

a try to see if it is faster.


#8

You should never splat large collections of things. You are literally doing 50000-way multiple dispatch.


#9
onedict(n=10) = Dict(rand(1:n) => rand(1:n) for i in 1:n)
manydicts(m) = [onedict() for i in 1:m]
mymerge(a) = merge!(typeof(a[1])(),a...)

function mymerge1(a)
    dout = typeof(a[1])()
    for d in a
        for k in keys(d)
            dout[k] = d[k]
        end
    end
    return dout
end

# Array of 50 thousand dicts
const a = manydicts(50*10^3);

# Splatting and merge!
julia> @btime mymerge(a);
  7.398 ms (6 allocations: 391.36 KiB)

# Using generic `reduce` method
julia> @btime reduce(merge,a);
  15.345 ms (199996 allocations: 28.99 MiB)

# Naive efficent method.
 julia> @btime mymerge1(a);
  8.516 ms (4 allocations: 608 bytes)

# stack overflow, or out of memory, or takes a long time...
# merge(a...)

Splatting is fastest in this case. This is roughly the use case that led me to the problem. But, I found that the best method depends on the number of items in the dicts, the typical number of different items in the dicts, the type of the dicts, and the type of container. I didn’t try all combinations, but the execution time of methods that are not terrible (ie merge(a...)) tends to vary by a factor of two. One could try to write logic to distinguish among the cases. But, this adds code complexity. And my feeling was that some of the inefficiencies (say due to container type) will be improved in the future, so that the logic would have to be changed.

The inefficiency in merge(a...) is due not to splatting a large array, but rather to recursion in determining the output type. The output type is inferred. You can make merge(a...) efficient at the cost of giving up inference. @nalimilan objected to this, which is reasonable.

@Tamas_Papp Yes, that might be efficient in some cases. But, I guess it would be contingent on many factors.

EDIT: the efficiency is not always within a factor of two. In this case the generic method for reduce is not performant.


#10

I’m curious, are these Dicts with a lot of duplicate keys, or few if any?
That can change the best way of handling merging them.
Also, what do you expect when there are duplicates - what if the values are dictionaries themselves, do you merge them recursively (i.e. as you’d want to do if the dict was representing a multidimensional associative array)?
Stefan is right about not using splatting for this!


#11

I have recieved the same warning several times, but the splat gramma is so cute… (and I dont understand why 50000-way multiple dispatch implies slowness.) Is it possible just to make splat fast?


#12

The context is, there are more than 60000 documents, from each one, 1 or 2 (depends on the document) arrays might be extracted. Extraction is time-consuming, so they are processed independently and paralleled via pmap. pmap returns an array of Dicts. The key of each dict is "$filename-$id", where id=1 or 2. So in the end, I would like to merge those dicts into a large dict.


#13

It looks like you are saying: You have an array of 60000 dicts. Each dict has exactly one key, which has the form "$filename-$id" . The filenames are unique, so the keys are as well. Is the information you want encoded in the key, and you don’t care about the value ? Your example code seems to be close to your real use case. Using the same functions I defined above:

Julia v0.6.2

julia> using BenchmarkTools
julia> const a = [Dict(rand() => rand()) for i in 1:50000];
julia> @btime mymerge(a);
  7.679 ms (38 allocations: 6.05 MiB)
julia> @btime mymerge1(a);
  7.341 ms (36 allocations: 5.67 MiB)
julia> @btime reduce(merge,a);
  1.214 s (654008 allocations: 1.56 GiB)
julia> @btime reduce(merge!, $(copy(a[1])), a);
  7.741 ms (0 allocations: 0 bytes

Julia v0.7.0-DEV.4810

julia> @btime mymerge(a);
  5.698 ms (38 allocations: 6.05 MiB)
julia> @btime mymerge1(a);
  5.568 ms (36 allocations: 5.67 MiB)
julia> @btime reduce(merge,a);
  185.915 ms (226659 allocations: 1.34 GiB)
julia> @btime reduce(merge!, $(copy(a[1])), a);
  5.457 ms (0 allocations: 0 bytes)

Nice to see that v0.7 is uniformly faster! In particular, the biggest gain is in the most inefficient method reduce(merge,a). Since it’s a one-off script, you can choose any of the remaining methods. Notice that splatting, in itself, is not slow for this task.

Here, splatting is slower and does more allocation (v0.7):

julia> const b = collect(1:10^5);
julia> @btime sum(b);
  19.561 μs (0 allocations: 0 bytes)
julia> f(a...) = sum(a);
julia> @btime f(b);
  85.248 μs (2 allocations: 781.33 KiB)

#14

Most of the dicts have only one key, and about 1/5 of the dicts have two keys. This heterogeneity is the reason why I choose Dict as storage.


#15

Ok, I figured you chose Dict for some good reason. It seems that the current plan to handle this kind of thing will be reduce(merge,a). By “this kind of thing”, I mean a function like merge that may take a variable (and potentially large) number of arguments. The idea is not to find the myriad best solutions for various use cases, but rather to write a specialized version of reduce(merge,a) that doesn’t perform terribly for any of a large number of use cases. This will hopefully be done for v0.7. For v0.6.2, you now have seen some other choices.


#16

Why return those as a dict at all then?
You could simply return a tuple, ("filename", value1) or ("filename", (value1, value2)), or a Pair,
"filename" => value1 "filename" => (value1, value2), which would be much much more efficient than creating a Dict (which have quite a bit of overhead).
You put the results all into a vector, and simply call Dict(array) to create the large dict.


#17

Oh, forgot to mention a detail: in the end, all values will be gathered. Two values from a same file will be treated as two independent values. So, the grand dict is like

Dict(
    "file1-1" => value1,
    "file2-1" => value1,
    "file2-2" => value2,
    ......
)

#18

You’re welcome to give it a try.


#19

Instead, why couldn’t have the final structure be as follows (or use a vector for the values, but that takes a lot more space):

Dict(
    "file1" => value1,
    "file2" => (value1, value2),
    ......
)

#20

Okay, will try it perhaps at next month :male_detective: