I do wish Base
seemed a little more cognizant of the inefficiency of splatting, particularly in regard to vcat
and hcat
(I guess merge
is essentially the equivalent for dicts), though I know this has been discussed elsewhere before. I think that the recommended method at the moment is reduce(vcat, collection)
or reduce(merge, collection)
as I think @ScottPJones suggested.
My suggestion was to avoid having the small dicts anyway (they really donât seem to serve any useful purpose), and simplify the problem to simply creating the large dictionary out of a vector of pairs (which is well supported)
If creates lots of vectors of pairs, like
[("file1-1", value1)],
[("file2-1", value1), ("file2-2", value2)]
...
Then we still need to cat them together, and cat is no better than mergeâŚ
Is there a reason why you canât have the values: "filename" => value1
or "filename" => (value1, value2)
?
Why would you need to cat them together?
Presumably you know how many files you are processing, you can create a vector for the results with a sizehint
and push!
the results as they are received, or simply allocate it, and keep a counter, vec[pos += 1] = pair
, storing each result as it is received.
Again, that makes creating the final associative array of results simply Dict(vec)
.
Okay, more context. Each value will be later sent to a neural network. âfilename-idâ is the key for each value in the hdf5 file. Values typically will be shuffled. Pair info is no use here.
OK. What percentage of the files return 2 values instead of 1?
Didnât know a priori, but after processing it, there are 45k files return single values, and 12k files return two values.
Sure, my comment was more directed toward the general issue of there being too much stuff in Base
that encourages splatting. If splatting is bad, I feel Base
should reflect that.
Here is a benchmark, I had just guessed at 20%, which is pretty close to your actual value (21%)
using BenchmarkTools
const cnt = 50000
const per = 20
const a = Vector{Dict{String,UInt}}()
const b = Vector{Pair{String,UInt}}()
sizehint!(a, cnt)
sizehint!(b, cnt * div((100+per), 100))
for i=1:cnt
fid = string(rand(UInt))
flg = rand(0:99) < per
res = [rand(UInt) for j=1:(1+flg)]
push!(a, Dict([fid * "-$j" => res[j] for j = 1:(1+flg)]))
push!(b, fid * "-1" => res[1])
flg && push!(b, fid * "-2" => res[2])
end
mymerge(a) = merge!(typeof(a[1])(),a...)
function mymerge1(a)
dout = typeof(a[1])()
for d in a
for k in keys(d)
dout[k] = d[k]
end
end
return dout
end
# Splatting and merge!
@btime mymerge(a);
# Using generic `reduce` method
@btime reduce(merge,a);
# Naive efficent method.
@btime ra = mymerge1(a);
# stack overflow, or out of memory, or takes a long time...
# merge(a...)
@btime rb = Dict(b) ;
The results I got are the following:
10.277 ms (38 allocations: 6.05 MiB)
407.654 ms (235993 allocations: 1.59 GiB)
11.554 ms (36 allocations: 5.67 MiB)
8.576 ms (36 allocations: 5.67 MiB)
showing that the approach I outlined is definitely faster (and doesnât require any merge/reduce functions to be written).
Splatting isnât inherently bad or it wouldnât be a feature of the language in the first place. Splatting to reduce over a collection is whatâs bad and should be avoided. How would we prevent that? Not have a +
operator and require people to write sum([a, b])
in place of a + b
? Well that would certainly remove the temptation to splat many values to +
but talk about throwing the baby out with the bath water.
I get what youâre saying and I donât disagree with you.
What I was getting at is the following: if some random person comes along and wants to concatenate a few thousand arrays, chances are very high that they would use vcat(arrays...)
. We could have philosophical discussions about how you get yourself into the situation of having to concatenate these arrays in the first place, but I think itâs safe to say this is not an unusual thing for people to do. All Iâm saying is that it would be nice if there were some sort of non-splatting concatenation functions with both 1. exist and 2. are prominently featured in the documentation.
To your point, I canât think of any other examples where this is a problem off the top of my head other than the various concatenation functions weâve been discussing here, so itâs not that bad.
And look at that, I had even hearted your original comment. So apparently weâve already had this entire discussion before
Not only that. I repeated it in this very discussion. I was thinking of the result of the thread that Stefan already linked to. For the record, in the original discussion it looks like @TotalVerb (on github) introduced the idea, and @StefanKarpinski so to speak âdiscovered it independentlyâ the following day.
There are a number of issues mixed together in this thread (how many times has that been written ?). The main issues (or some of them) are:
-
What is the OPâs problem and what are the best ways to solve it ? The answers would likely apply to a more general class of problems. This is what @ScottPJones has worked on, and maybe is the proper subject for the thread. But, the title of this thread refers to issue 3 below.
-
In some cases, for a large number of arguments, splatting is an inefficient way to do what really is a reduction operation. But, in some cases, the generic methods for
reduce
are algorithmically inefficient. The execution time increases faster than linearly in the size of the input vector, although in many cases other factors are more important. The most important functions for which splatting is a problem are probablyhcat
andvcat
. (Now that I think about it,hcat!
might be useful.) One solution is to write particular methods forreduce
reducing withhcat
, ormerge
, etc. This was suggested in the thread linked above. The big advantage is that it does not increase the number of âverbsâ. And once you learn this idiom in Julia, you can apply it to other functions. Less code complexity. Less to learn. -
merge(array...)
causes a stack overflow or allocates all memory etc. This is the sole reason that this thread exists. I also encountered it, and it is the sole reason I became involved. This issue has nothing at all to do with issue number 2 (splatting). It is completely unrelated.
The difference between 2 and 3 is the difference between âslowâ or âvery slowâ, and âcannot be doneâ. @innerlee seems to say that this is sort of one-off code, as was mine. If merge
only suffered from issue 2, but not 3, then for these use cases and 10^4
or 10^5
elements, I might have noticed that this step was slow (unlikely based on my benchmarks above). But, even so I was probably in a hurry and would have let it go. Likewise, @innerlee would not have asked a question. In general splatting a large number of arguments should be discouraged. But whether it is fast or slow depends on what âlargeâ means, and on which cases included in âgeneralâ⌠maybe its worth writing something about this.
By splatting with merge!
instead of merge
, you are only dealing with issue number 2, but not issue number 3. And, as I showed above, the operation can be just as fast or faster than some reasonable-looking solutions that avoid splatting. As you increase the length of the input vector, I suppose that eventually issue 2 will catch you.
You can easily rewrite the method for merge
so that merge(a...)
is only slow in the sense of number 2. But, this renders it non-inferrable. promote_type
is inferrable for three arguments, but not four (in v0.6.2, but not current v0.7). But, the code for merge
includes methods that call promote_type
in such a way as to infer the promoted type of all input arguments. It is recursive, and fails for a large number of arguments.
EDIT: The behavior of promote_type
in v0.7 is more complicated than what I stated above.
hi Stefan, I now have some time to try to understand & investigate the splatting ting.