`cat` allocates too much memory

Hi. I’m new to Julia language.

I’m trying to convert a big vector of small arrays into a tensor, but cat() is drawing too much resources.

julia> A = [rand(Int, (28,28)) for _ ∈ 1:10000];

julia> @time cat(A..., dims=3);
  4.226158 seconds (50.11 M allocations: 3.044 GiB, 30.98% gc time)

Oddly, making the operation more explicit dramatically improves this:

julia> stacked = Array{Int, 3}(undef, 28, 28, 10000);

julia> @time map!.(x->x, eachslice(stacked, dims=3), A);
  0.133267 seconds (222.95 k allocations: 11.972 MiB)

This reply to another topic mentions that reduce(hcat, A) and reduce(vcat, A) are optimized to perform better than hcat(A...) and vcat(A...). Is there such thing for cat()?

The following naive attempt made things worse:

julia> @time reduce((x...)->cat(x..., dims=3), A);
  4.431142 seconds (616.15 k allocations: 18.539 GiB, 1.53% gc time)
1 Like

it’s not cat that is allocating, it’s the splatting

4 Likes

You’ll probably get the best possible performance by doing this in the simplest way: Construct a new array of the appropriate size and then write a loop that fills in each slice of the stacked array. One-liner solutions are convenient, but a less-clever approach often offers the best performance.

8 Likes

Is there a way to bypass that?

This solution doesn’t use splatting and perform even worse:

julia> @time reduce((x,y)->cat(x, y, dims=3), A);
  6.262270 seconds (526.79 k allocations: 18.537 GiB, 3.85% gc time)

this allocate once for ever element so even more allocation. Do what @rdeits suggested I think.

Yeah, the loop is more than 100X faster than most of the options above and allocates vastly less memory:

julia> function assemble(A)
         stacked = Array{Int, 3}(undef, size(first(A))..., length(A))
         for i in 1:length(A)
           stacked[:, :, i] = A[i]
         end
         stacked
       end
assemble (generic function with 1 method)

julia> using BenchmarkTools

julia> A = [rand(Int, (28,28)) for _ ∈ 1:10000];

julia> @btime assemble($A);
  24.793 ms (2 allocations: 59.81 MiB)
1 Like

here’s a version using an optimized reduction and a reshape

julia> reshape(reduce(hcat, A), size(A[1])..., :) == cat(A..., dims=3)
true

julia> @time reshape(reduce(hcat, A), size(A[1])..., :);
  0.028446 seconds (7 allocations: 59.815 MiB)
3 Likes

So why is cat so much slower than vcat:

julia> @time reduce((x,y)->cat(x, y;dims=3), A);
 12.606505 seconds (526.91 k allocations: 18.537 GiB, 16.69% gc time)

julia> @time reduce(vcat, A);
  0.038854 seconds (2 allocations: 59.815 MiB)

?

1 Like

reduce(vcat, ...) and reduce(hcat, ...) are fast because they have specialized implementations: https://github.com/JuliaLang/julia/blob/e402cf47dd8e3c509969c90c38ad5d57c746eccf/base/abstractarray.jl#L1564-L1568 . In other words, they produce the same result as naively calling vcat on each element, but they do so in a way which is much more efficient than the naive approach (essentially by performing the same allocation and then loop operation that I wrote out by hand).

3 Likes

It seems like a special implementation for cat would be natural. Is that perhaps more difficult to do, since (x, y)->cat(x, y;dims=n) is anonymous?

yeah I think the dims info is gone.

This PR https://github.com/JuliaLang/julia/pull/37196 wants to make cat(dims=3)::Base.Fix1 to enable precisely this, and then make reduce(cat(dims=3), A) efficient. For now @baggepinnen’s answer is probably the way to go.

Unless you want to use a package, in which case you can make a view of them, instead of a new dense array. These should be fast to construct but possibly slower to use, in whatever the next step is.

julia> B = reduce(cat(dims=3), A);

julia> B ≈ JuliennedArrays.Align(A, 1,2)
true

julia> B ≈ LazyStack.stack(A)
true

julia> B ≈ RecursiveArrayTools.VectorOfArray(A)
true
3 Likes