Stack overflow when merging 50000 dicts

ExpandingMan · April 9, 2018, 2:22pm

I do wish Base seemed a little more cognizant of the inefficiency of splatting, particularly in regard to vcat and hcat (I guess merge is essentially the equivalent for dicts), though I know this has been discussed elsewhere before. I think that the recommended method at the moment is reduce(vcat, collection) or reduce(merge, collection) as I think @ScottPJones suggested.

ScottPJones · April 9, 2018, 2:26pm

My suggestion was to avoid having the small dicts anyway (they really don’t seem to serve any useful purpose), and simplify the problem to simply creating the large dictionary out of a vector of pairs (which is well supported)

innerlee · April 9, 2018, 2:34pm

If creates lots of vectors of pairs, like

[("file1-1", value1)], 
[("file2-1", value1), ("file2-2", value2)]
...

Then we still need to cat them together, and cat is no better than merge…

ScottPJones · April 9, 2018, 2:39pm

Is there a reason why you can’t have the values: "filename" => value1 or "filename" => (value1, value2)?
Why would you need to cat them together?
Presumably you know how many files you are processing, you can create a vector for the results with a sizehint and push! the results as they are received, or simply allocate it, and keep a counter, vec[pos += 1] = pair, storing each result as it is received.
Again, that makes creating the final associative array of results simply Dict(vec).

innerlee · April 9, 2018, 3:03pm

Okay, more context. Each value will be later sent to a neural network. “filename-id” is the key for each value in the hdf5 file. Values typically will be shuffled. Pair info is no use here.

ScottPJones · April 9, 2018, 3:06pm

OK. What percentage of the files return 2 values instead of 1?

innerlee · April 9, 2018, 3:12pm

Didn’t know a priori, but after processing it, there are 45k files return single values, and 12k files return two values.

ExpandingMan · April 9, 2018, 3:17pm

Sure, my comment was more directed toward the general issue of there being too much stuff in Base that encourages splatting. If splatting is bad, I feel Base should reflect that.

ScottPJones · April 9, 2018, 3:22pm

Here is a benchmark, I had just guessed at 20%, which is pretty close to your actual value (21%)

using BenchmarkTools

const cnt = 50000
const per = 20
const a = Vector{Dict{String,UInt}}()
const b = Vector{Pair{String,UInt}}()

sizehint!(a, cnt)
sizehint!(b, cnt * div((100+per), 100))

for i=1:cnt
    fid = string(rand(UInt))
    flg = rand(0:99) < per
    res = [rand(UInt) for j=1:(1+flg)]
    push!(a, Dict([fid * "-$j" => res[j] for j = 1:(1+flg)]))
    push!(b, fid * "-1" => res[1])
    flg && push!(b, fid * "-2" => res[2])
end

mymerge(a) = merge!(typeof(a[1])(),a...)

function mymerge1(a)
    dout = typeof(a[1])()
    for d in a
        for k in keys(d)
            dout[k] = d[k]
        end
    end
    return dout
end

# Splatting and merge!
@btime mymerge(a);

# Using generic `reduce` method
@btime reduce(merge,a);

# Naive efficent method.
@btime ra = mymerge1(a);

# stack overflow, or out of memory, or takes a long time...
# merge(a...)

@btime rb = Dict(b) ;

The results I got are the following:

  10.277 ms (38 allocations: 6.05 MiB)
  407.654 ms (235993 allocations: 1.59 GiB)
  11.554 ms (36 allocations: 5.67 MiB)
  8.576 ms (36 allocations: 5.67 MiB)

showing that the approach I outlined is definitely faster (and doesn’t require any merge/reduce functions to be written).

StefanKarpinski · April 9, 2018, 4:27pm

Splatting isn’t inherently bad or it wouldn’t be a feature of the language in the first place. Splatting to reduce over a collection is what’s bad and should be avoided. How would we prevent that? Not have a + operator and require people to write sum([a, b]) in place of a + b? Well that would certainly remove the temptation to splat many values to + but talk about throwing the baby out with the bath water.

ExpandingMan · April 9, 2018, 4:43pm

I get what you’re saying and I don’t disagree with you.

What I was getting at is the following: if some random person comes along and wants to concatenate a few thousand arrays, chances are very high that they would use vcat(arrays...). We could have philosophical discussions about how you get yourself into the situation of having to concatenate these arrays in the first place, but I think it’s safe to say this is not an unusual thing for people to do. All I’m saying is that it would be nice if there were some sort of non-splatting concatenation functions with both 1. exist and 2. are prominently featured in the documentation.

To your point, I can’t think of any other examples where this is a problem off the top of my head other than the various concatenation functions we’ve been discussing here, so it’s not that bad.

StefanKarpinski · April 9, 2018, 4:49pm

I certainly don’t disagree with that.

ExpandingMan · April 9, 2018, 4:56pm

And look at that, I had even hearted your original comment. So apparently we’ve already had this entire discussion before

jlapeyre · April 9, 2018, 7:40pm

Not only that. I repeated it in this very discussion. I was thinking of the result of the thread that Stefan already linked to. For the record, in the original discussion it looks like @TotalVerb (on github) introduced the idea, and @StefanKarpinski so to speak “discovered it independently” the following day.

There are a number of issues mixed together in this thread (how many times has that been written ?). The main issues (or some of them) are:

What is the OP’s problem and what are the best ways to solve it ? The answers would likely apply to a more general class of problems. This is what @ScottPJones has worked on, and maybe is the proper subject for the thread. But, the title of this thread refers to issue 3 below.
In some cases, for a large number of arguments, splatting is an inefficient way to do what really is a reduction operation. But, in some cases, the generic methods for reduce are algorithmically inefficient. The execution time increases faster than linearly in the size of the input vector, although in many cases other factors are more important. The most important functions for which splatting is a problem are probably hcat and vcat. (Now that I think about it, hcat! might be useful.) One solution is to write particular methods for reduce reducing with hcat, or merge, etc. This was suggested in the thread linked above. The big advantage is that it does not increase the number of “verbs”. And once you learn this idiom in Julia, you can apply it to other functions. Less code complexity. Less to learn.
merge(array...) causes a stack overflow or allocates all memory etc. This is the sole reason that this thread exists. I also encountered it, and it is the sole reason I became involved. This issue has nothing at all to do with issue number 2 (splatting). It is completely unrelated.

The difference between 2 and 3 is the difference between “slow” or “very slow”, and “cannot be done”. @innerlee seems to say that this is sort of one-off code, as was mine. If merge only suffered from issue 2, but not 3, then for these use cases and 10^4 or 10^5 elements, I might have noticed that this step was slow (unlikely based on my benchmarks above). But, even so I was probably in a hurry and would have let it go. Likewise, @innerlee would not have asked a question. In general splatting a large number of arguments should be discouraged. But whether it is fast or slow depends on what “large” means, and on which cases included in “general”… maybe its worth writing something about this.

By splatting with merge! instead of merge, you are only dealing with issue number 2, but not issue number 3. And, as I showed above, the operation can be just as fast or faster than some reasonable-looking solutions that avoid splatting. As you increase the length of the input vector, I suppose that eventually issue 2 will catch you.

You can easily rewrite the method for merge so that merge(a...) is only slow in the sense of number 2. But, this renders it non-inferrable. promote_type is inferrable for three arguments, but not four (in v0.6.2, but not current v0.7). But, the code for merge includes methods that call promote_type in such a way as to infer the promoted type of all input arguments. It is recursive, and fails for a large number of arguments.

EDIT: The behavior of promote_type in v0.7 is more complicated than what I stated above.

innerlee · May 20, 2018, 6:04am

hi Stefan, I now have some time to try to understand & investigate the splatting ting.

nalimilan · May 20, 2018, 8:45am

See also https://github.com/JuliaLang/julia/pull/26440.