Slow `reduce(vcat, itr)`

reduce(vcat, itr) and reduce(hcat, itr) have special implementations that are supposed to make them faster than vcat(itr...) and hcat(itr...), at least for long iterators. But I am seeing the opposite:

It starts out ok:

julia> A = rand(10, 10);

julia> @btime vcat(eachcol($A)...);
  1.700 μs (34 allocations: 3.34 KiB)

julia> @btime reduce(vcat, eachcol($A));
  1.200 μs (34 allocations: 6.34 KiB)

But then:

julia> A = rand(10, 100);

julia> @btime vcat(eachcol($A)...);
  18.300 μs (314 allocations: 44.42 KiB)

julia> @btime reduce(vcat, eachcol($A));
  38.300 μs (394 allocations: 425.31 KiB)

Now it gets quite a bit worse:

julia> A = rand(10, 1000);

julia> @btime vcat(eachcol($A)...);
  181.700 μs (3512 allocations: 476.06 KiB)

julia> @btime reduce(vcat, eachcol($A));
  3.029 ms (4790 allocations: 38.44 MiB)

And now it explodes, worse than quadratic time complexity, it seems:

julia> A = rand(10, 10000);

julia> @btime vcat(eachcol($A)...);
  1.771 ms (39516 allocations: 3.89 MiB)

julia> @btime reduce(vcat, eachcol($A));
  1.602 s (49790 allocations: 3.73 GiB)

The situation is similar for hcat.

Furthermore, why is the splatting version not slowing down? Splatting containers of dynamic size is supposed to be slow, but that’s not apparent at all?

What’s more, it is even faster to first collect the columns in an array comprehension, which is then splatted into vcat:

julia> A = rand(10, 10000);

julia> @btime vcat(eachcol($A)...);
  1.865 ms (39516 allocations: 3.89 MiB)

julia> @btime vcat([a for a in eachcol($A)]...);
  1.469 ms (20016 allocations: 3.36 MiB)

This is making me very unhappy, as it really messes with my understanding of how to achieve good performance. My mental model is that eachcol->array_comprehension->splat->vcat should be the slowest since it does extra unnecessary work and allocates an intermediate array; eachcol->splat->vcat should be second, and reduce->vcat should be the fastest.

Not sure it’s exactly the same but I’m pretty sure there’s a GitHub issue by @oxinabox somewhere about reduce(vcat,..) being a performance footgun in certain situations.

I actually saw that (Optimized reduce(vcat, xs) and reduce(hcat, xs) for xs being any iterator · Issue #31636 · JuliaLang/julia · GitHub), but misinterpreted it to be an old issue that referred to implementation of reduce(vcat), which I presumed had then later happened.

But what is reduce(vcat, itr) good for then? And why is splatting not slowing down for long itr, which it in my mind would.

And why is it even better to first collect in a comprehension and then splatting?

I noticed this using JET on v1.8.1:

julia> @report_opt vcat(eachcol(A)...)
No errors detected


julia> @report_opt reduce(vcat, eachcol(A))
═════ 99 possible errors found ═════
┌ @ reduce.jl:483 Base.:(var"#reduce#265")(pairs(NamedTuple()), #self#, op, itr)
│┌ @ reduce.jl:483 mapreduce(identity, op, itr)
││┌ @ reduce.jl:302 Base.:(var"#mapreduce#263")(pairs(NamedTuple()), #self#, f, op, itr)
│││┌ @ reduce.jl:302 mapfoldl(f, op, itr)
││││┌ @ reduce.jl:170 Base.:(var"#mapfoldl#259")(Base._InitialValue(), #self#, f, op, itr)
│││││┌ @ reduce.jl:170 Base.mapfoldl_impl(f, op, init, itr)
││││││┌ @ reduce.jl:44 Base.foldl_impl(op′, nt, itr′)
│││││││┌ @ reduce.jl:48 v = Base._foldl_impl(op, nt, itr)
││││││││┌ @ reduce.jl:62 v = op(v, y[1])
│││││││││┌ @ reduce.jl:95 op.rf(acc, op.f(x))
││││││││││┌ @ reduce.jl:81 op.rf(acc, x)
│││││││││││┌ @ abstractarray.jl:1564 Base.typed_vcat(tuple(T), V...)
││││││││││││┌ @ abstractarray.jl:1648 Base._typed_vcat(T, A)
│││││││││││││┌ @ abstractarray.jl:1572 sum(map(length, V))
││││││││││││││┌ @ reduce.jl:557 Base.:(var"#sum#267")(pairs(NamedTuple()), #self#, a)
│││││││││││││││┌ @ reduce.jl:557 sum(identity, a)
││││││││││││││││┌ @ reduce.jl:528 Base.:(var"#sum#266")(pairs(NamedTuple()), #self#, f, a)
│││││││││││││││││┌ @ reduce.jl:528 mapreduce(f, Base.add_sum, a)
││││││││││││││││││┌ @ reduce.jl:302 Base.:(var"#mapreduce#263")(pairs(NamedTuple()), #self#, f, op, itr)
│││││││││││││││││││┌ @ reduce.jl:302 merge(Base.NamedTuple(), kw)
││││││││││││││││││││┌ @ namedtuple.jl:313 Core.apply_type(Base.NamedTuple, tuple(names...))(tuple(vals...))
│││││││││││││││││││││┌ @ boot.jl:603 Core.apply_type(NamedTuple, names, typeof(args))(args)
││││││││││││││││││││││┌ @ namedtuple.jl:96 T(args)
│││││││││││││││││││││││┌ @ tuple.jl:312 convert(T, x)
││││││││││││││││││││││││┌ @ essentials.jl:339 Val(N)
│││││││││││││││││││││││││┌ @ essentials.jl:714 %1()
││││││││││││││││││││││││││ runtime dispatch detected: %1::Type{Val{_A}} where _A()::Val
│││││││││││││││││││││││││└─────────────────────
[..output trimmed as it's too long...]

Not to answer your question about reduce directly, but to give a clean and performant solution:

julia> A = rand(10, 10000);

julia> @btime vcat(eachcol($A)...);
  1.548 ms (39516 allocations: 3.89 MiB)

julia> @btime reduce(vcat, eachcol($A));
  391.733 ms (49790 allocations: 3.73 GiB)

julia> using SplitApplyCombine

# twice faster than `vcat(...)`
julia> @btime flatten(eachcol($A));
  726.799 μs (100011 allocations: 3.50 MiB)

# further 3x faster, with much fewer allocations
julia> @btime flatten(splitdimsview($A));
  260.728 μs (11 allocations: 1.98 MiB)

# for some reason, further 2x faster - maybe, flatten can be improved...
julia> @btime reduce(vcat, splitdimsview($A));
  117.177 μs (4 allocations: 859.47 KiB)

flatten is like Iterators.flatten in stdlib, but array-focused.

2 Likes

Does splitdimsview more or less replicate the functionality of eachcol/eachrow/eachslice? If so, it seems like it should be possible to recover performance for reduce(vcat, eachcol(A)).

Arrays of arrays, I believe (as opposed to general iterators).

2 Likes

Note that for this particular case you can do vec(A), which requires no copies at all.

Indeed, but there are more operations involved which I excised in order to focus squarely on the concatenation part.

The eachslice family has been pretty slow for a lot of time in julia, I think future versions (from 1.9) will make it noticeably faster. splitdims/splitdimsview is indeed an implementation of the same concept, but (a) is an abstractarray, (b) actually performant, and (c) works on all relevant julia versions.
Btw, also there are combinedims/combinedimsview in the same package, the former basically has the same interface as stack() that will appear in Julia 1.9.

I believe @andyferris (author of SplitApplyCombine.jl) said that a goal of this package is to evaluate usefulness. interface, and performance of different functions working with data, before potential inclusion in Base. This plan seems to work out just fine (:

2 Likes

Looking at methods(reduce), there are no specializations for hcat/vcat on v1.8.0 except for on AbstractVector{<:AbstractVecOrMat}. So it would seem that you’re hitting the slow path with the default reduce behavior.
Notice that even tuples hit the slow path (of course, they could be performantly splatted into hcat):

julia> using BenchmarkTools

julia> nt = ntuple(Returns(collect(1:3)),10)
([1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3])

julia> @btime hcat($nt...)
  61.060 ns (1 allocation: 304 bytes)
3×10 Matrix{Int64}:
 1  1  1  1  1  1  1  1  1  1
 2  2  2  2  2  2  2  2  2  2
 3  3  3  3  3  3  3  3  3  3

julia> @btime reduce(hcat,$nt)
  332.883 ns (9 allocations: 1.80 KiB)
3×10 Matrix{Int64}:
 1  1  1  1  1  1  1  1  1  1
 2  2  2  2  2  2  2  2  2  2
 3  3  3  3  3  3  3  3  3  3

EDIT: Thanks to those below for pointing out a mistake. stack does, indeed, support a dims argument. I have removed the offending text, since strikethrough does not appear to be an option.

1 Like

Are you sure? It says in the linked pr that stack takes a dims keyword.

1 Like

To me flatten seems most consistent: it’s exactly like Iterators.flatten, but for arrays.

Yes. This appears not to be documented anywhere else, which 46429 would like to fix.

Yes it does. Note that dims=1 is the opposite of eachrow, rather than being like vcat

julia> A = [1 2 3; 4 5 6.0];

julia> vcat(eachrow(A)...)
6-element Vector{Float64}:
 1.0
 2.0
 3.0
 4.0
 5.0
 6.0

julia> stack(eachrow(A); dims=1)
2×3 Matrix{Float64}:
 1.0  2.0  3.0
 4.0  5.0  6.0

julia> hcat(eachrow(A)...) == stack(eachrow(A))
true

Note that you can use stack from Compat.jl, without waiting for 1.9, just using Compat.

This would be the obvious companion to stack for things more like reduce(vcat, xs). IIRC there is no issue, but it’s come up a few times elsewhere?

1 Like

Having put some more thought into this, I’m not sure if it is actually possible to do much better for generic iterators.

We generally guarantee that iterators will only be iterated once in most functions. This is important for the correctness of stateful iterators and for the performance of iterators that take a long time to compute their values.

In order to perform an “efficient” concatenation of an iterator, we need to know the size of each value it produces so that the result can be preallocated. This can be done efficiently when the values are arrays or subarrays that we can simply inspect, but in other cases the values may be very expensive to produce such that the computation of values dwarfs the cost of concatenation-without-preallocation. In these cases, it would be inefficient to iterate through once just to compute the sizes just so we could preallocate and then iterate again. And to reiterate, this would be outright wrong for stateful iterators.

It would be nice if we could catch cases where the sizes (or values) can be produced cheaply and safely so that we can iterate through to preallocate the output, but we can’t just do this in general without risking a 2x performance loss (for expensive iteration) and correctness bugs (for stateful iteration).

It seems that an appropriate fallback for generic iterators would be reduce(hcat,itr) = reduce(hcat,collect(itr)) (and likewise for vcat). This ensures the iterator is only iterated once yet allows preallocation, although at the cost of the intermediate collect.

julia> A = rand(10, 10000);

julia> @btime reduce(vcat, eachcol($A));
  340.489 ms (49790 allocations: 3.73 GiB)

julia> @btime reduce(vcat, collect(eachcol($A)));
  141.000 μs (6 allocations: 1.22 MiB)

It’d be better to cut away cases from the fallback. AbstractVectors are already handled, but other containers like Tuple and AbstractArray (and AbstractDict, as a weird but technically-legal input to a concatenation) would be other candidates for a non-collecting variant.

But I think that continuing to cram this behavior under a specialization of reduce is probably a mistake and this should be handled via a dedicated function (either v1.9’s stack or some new concatenation function).

BTW, stack’s approach is that it collects if the length of the iterator is unknown, or if it cannot infer the eltype of the slices. But this is an easier problem than vcat, as all slices must be the same size. It always allocates the whole array up front.

Making reduce(vcat, ...) and friends collect iterators, and accept higher-dim array & tuples, might be worth doing. But it is tricky to know when this is going to be fast.

There’s a prototype here for an eager flatten function. But maybe someone should make an issue to see if it’s likely to be accepted, first.

Preallocation of the correct size is not the reason why

is that slow. A very simple “create empty array and append! in a for loop” is orders of magnitude faster, almost as fast as reduce(vcat, collect(...)). That’s how flatten has been implemented in https://github.com/JuliaData/SplitApplyCombine.jl/blob/main/src/map.jl for a long long time, see my comment above for benchmarks.

1 Like

My point in all of this is that reduce(vcat,...) is problematic because it’s the best way to do some things but a horrifying performance trap in others. Further, it’s a tad complicated to know in which cases it is good and in which it is bad. Relying on specific type signatures to point you down a narrow performance fast-path – with performance deathtraps on all sides – is fragile design.

I would prefer to destroy the idiom and introduce a function whose sole job is to concatenate arrays, leaving the existing reduce specializations as legacy support but deprecated and rewritten to simply forward to the new function. reduce(vcat,...) is specifically documented to operate pairwise on elements (although an outstanding PR will document the vcat/hcat specializations), which is the exact wrong way one should concatenate arrays and why it is so unintuitive. In that regard, these specializations are semantically correct but algorithmically misleading.

It looks like stack might already be the function we need, assuming it can supersede all existing efficient uses of reduce for concatenation. If there are still missing cases, I’d suggest we add them to stack rather than continue to bloat reduce.

1 Like

I agree with this.

But I don’t understand what you mean here. You cannot remove reduce(hcat, ...), how would that even work? Unless you propose to deprecate either reduce or hcat for v2.0, which, again, seems completely impractical.