Transforming an Array{Matrix{T, N}, M} to an Array{T, N+M}

Hi,

usually I had a lot of success discovering the various ways to transform the Array datatypes into each other but this one gives me some headaches: From a parallelized operation I end up with an array (with an arbitrary number of dimensions of arbitrary length) of matrices (could be arbitrary large, but will have the same size for each entry).

I now wish to “flatten”(?) said Array{Matrix{ComplexF64}} to an Array{ComplexF64} with the dimensions of the matrix added. This will be used to store this array in an HDF5 file.

Toy example:

dimensions = (2,3,4) # could have an arbitrary number of dimensions
rand_matrix() = rand(ComplexF64, 3,3) # dimension of each matrix is identical
                                      # for each entry

# the result would actually be an Array{T, N} with N dimensions and an
# arbitrary type T, which I mapped to the result like
# result = map(x -> x.data, result)
# for the MWE I use
result = map(x->rand_matrix(), Array{Int}(undef,dimensions))

# Now I wish to transform this 2x3x4 Array{Vector{Matrix{ComplexF64}}, 3}
# to an Array{ComplexF64, 5}

# The following works but seems to be not very idiomatic ;)
transformed_result = Array{ComplexF64}(undef, size(result)..., size(first(result))...)
for i in Iterators.product(range.(1,size(result),step=1)...)
    transformed_result[i...,:,:] = result[i...]
end

What I don’t like about the solution:

  • It assumes that there are two dimensions in each “entry”, i.e. a matrix, but what if I eventually started with an Array{Array{T,3},4}?
  • It seems to be not very idiomatic. I especially feel that the use of Iterators.product is hiding the actually intended operation.

I feel that there could be an easy one-liner involving hvcat, but I can’t make it up.

Thanks for any ideas!

Sidenote: I could directly accumulate my results in the final data structure, but in my actual use-case I’m storing elements of a certain type T on which I also perform other operations later, i.e. by calling f.(result)

1 Like

Easiest I can think is to use the package EllipsisNotation.jl, then you could do:

julia> using EllipsisNotation

julia> for i in CartesianIndices(result)
           transformed_result[i,..] .= result[i]
       end

Instead of the loop based on Iterators.product. This would work no matter the number of dimensions. I think this is the cleaner solution which is also very readable. One thing to consider is that no matter what method you chose, you will have to copy all the data to the new structure.

4 Likes

Here’s one idea, using an example where N=3, M=2:

julia> B = [1 2;;; 3 4;;; 5 6]   # This syntax requires Julia 1.7
1×2×3 Array{Int64, 3}:
[:, :, 1] =
 1  2

[:, :, 2] =
 3  4

[:, :, 3] =
 5  6

julia> A = [B for _ in 1:2, _ in 1:3]
2×3 Matrix{Array{Int64, 3}}:
 [1 2;;; 3 4;;; 5 6]  [1 2;;; 3 4;;; 5 6]  [1 2;;; 3 4;;; 5 6]
 [1 2;;; 3 4;;; 5 6]  [1 2;;; 3 4;;; 5 6]  [1 2;;; 3 4;;; 5 6]

julia> reshape(cat(A..., dims=4), 1, 2, 3, 2, 3)
1×2×3×2×3 Array{Int64, 5}:
[:, :, 1, 1, 1] =
 1  2

[:, :, 2, 1, 1] =
 3  4
...

Does this do what you want?

Edit: generic version of the last line:

reshape(cat(A..., dims=ndims(B)+1), size(B)..., size(A)...)
1 Like

Another way is

transformed_result = reshape(reduce(vcat, (transpose(vec(r)) for r in vec(result))), size(result)..., size(first(result))...)
1 Like

Unless A is quite small, it’s probably better to do reduce((a,b)->cat(a,b;dims), A) than cat(A...;dims).

I think that will allocate and copy the data again and again for each element in A, so it depends:

reduce is better for large A and small B (elements of A):

julia> B = [1 2;;; 3 4;;; 5 6];

julia> A = [B for _ in 1:2000, _ in 1:3];

julia> @time cat(A..., dims=ndims(B)+1); # 2nd run
  1.114491 seconds (71.02 k allocations: 552.791 MiB, 13.32% gc time)

julia> @time reduce((a,b)->cat(a,b;dims=ndims(B)+1), A); # 2nd run
  0.295002 seconds (249.34 k allocations: 113.688 MiB)

but it’s worse for small A and large B:

julia> B = repeat([1 2;;; 3 4;;; 5 6], outer=(1, 1, 100_000));

julia> A = [B for _ in 1:20, _ in 1:3];

julia> @time cat(A..., dims=ndims(B)+1); # 2nd run
  0.603539 seconds (682 allocations: 274.713 MiB, 0.70% gc time)

julia> @time reduce((a,b)->cat(a,b;dims=ndims(B)+1), A); # 2nd run
 15.546297 seconds (57.78 k allocations: 8.179 GiB, 1.48% gc time)

I think the compiler could be improved to make cat(A..., dims=ndims(B)+1) efficient (in principle it should just need 1 allocation and 1 copy?)

For reduce I expect the inefficiency: there are specialized implementations of reduce for hcat and vcat but it would be difficult to do the same for a lambda…

2 Likes

That’s exactly what a function in SplitApplyCombine.jl does:

using SplitApplyCombine
combinedims(result)
# returns 3×3×2×3×4 Array{ComplexF64, 5}

Don’t want to allocate another array and prefer a view instead? It has combinedimsview() for this purpose.

5 Likes

Thanks everyone for the plenty of input!

I’ll be marking @aplavin’s response as solution for mainly these reasons:

  • The combinedimsview() is really what I’ve been looking for without really knowing, as I don’t need a full copy
  • I think the SplitApplyCombine package is probably what someone else would like to find who’s got a similar problem
  • I also have to confess that I had larger arrays in mind than I posted in my MWE (say 200x40x200 for the outer one). This renders the cat solutions more or less useless as I show below, even though I like them for their brevity and being in Base

The only thing I really dislike about this solution is that it is not in Base and therefore pulls in another dependency :wink:

For the sake of completeness I wrote a small benchmark script that tests the various approaches (note that I switched everything to being CM order).

Summary
#%%
using Pkg
Pkg.activate("--temp")
Pkg.add("BenchmarkTools")
Pkg.add("SplitApplyCombine")
Pkg.add("EllipsisNotation")

#%%
using BenchmarkTools
using SplitApplyCombine
using EllipsisNotation

#%%
function setup_data(outer_dims = (2,3,4), inner_dims = (3,3))
    rand_matrix() = rand(ComplexF64, inner_dims...)
    map(x->rand_matrix(), Array{Int}(undef, outer_dims))
end

#%% Proposed ways
function combine_loop(A)
    transformed_A = Array{ComplexF64}(undef, size(first(A))..., size(A)...)
    for i in Iterators.product(range.(1,size(A),step=1)...)
        transformed_A[:,:,i...] = A[i...]
    end
    transformed_A
end

function combine_loop_CartesianIndices(A)
    transformed_A = Array{ComplexF64}(undef, size(first(A))..., size(A)...)
    for i in CartesianIndices(A)
        transformed_A[:,:,i] = A[i]
    end
    transformed_A
end

function combine_loop_EllipsisNotation(A)
    transformed_A = Array{ComplexF64}(undef, size(first(A))...,size(A)...)
    for i in CartesianIndices(A)
        transformed_A[..,Tuple(i)...] = A[i]
    end
    transformed_A
end

function combine_cat(A)
    reshape(cat(A..., dims=ndims(first(A))+1), size(first(A))..., size(A)...)
end

function combine_cat_reduce(A)
    reshape(reduce((a,b)->cat(a,b;dims=ndims(first(A))+1), A), size(first(A))..., size(A)...)
end

function combine_SplitApplyCombine(A)
    combinedims(A)
end

function combine_SplitApplyCombine_view(A)
    combinedimsview(A)
end

#%% Verify that the are all idenctical
test_data = setup_data((2,3,4), (3,3))

test_functions = [combine_loop,
                  combine_loop_CartesianIndices,
                  combine_loop_EllipsisNotation,
                  combine_cat,
                  combine_cat_reduce,
                  combine_SplitApplyCombine,
                  combine_SplitApplyCombine_view]

individual_results = collect(f(test_data) for f in test_functions)


# check that all individual results are equal
@assert all(x -> x == individual_results[1], individual_results)

#%% Benchmark them                       
benchmark_small_only = [combine_cat] # <-- this needs to much memory on my system

small_dims = ((2,3,4), (3,3))
large_dims = ((200,40,200), (3,3))

benchmarks_small = []
benchmarks_large = []

for f in test_functions
    @show f
    push!(benchmarks_small, @benchmark $f(data) setup=(data=setup_data(small_dims...)))
    if !(f in benchmark_small_only)
        push!(benchmarks_large, @benchmark $f(data) setup=(data=setup_data(large_dims...)))
    else
        push!(benchmarks_large, missing)
    end
end

#%% Print some info

function stringmean(trial)
    t = mean(trial).time
    allocs = mean(trial).allocs
    memory = mean(trial).allocs
    "$(t) $allocs"
end

println("Small Dimensions ($small_dims)")
for (i, f) in enumerate(test_functions)
    println("\n$f:")
    display(mean(benchmarks_small[i]))
end

println("\n\nLarge Dimensions ($large_dims)")
for (i, f) in enumerate(test_functions)
    println("\n$f:")
    if !ismissing(benchmarks_large[i])
        display(mean(benchmarks_large[i]))
    else
        println("Too much memory / too slow.")
    end
end

which gives the following output on my moderately powered laptop

Summary
Small Dimensions (((2, 3, 4), (3, 3)))

combine_loop:
BenchmarkTools.TrialEstimate:
  time:             5.074 μs
  gctime:           321.659 ns (6.34%)
  memory:           4.67 KiB
  allocs:           26

combine_loop_CartesianIndices:
BenchmarkTools.TrialEstimate:
  time:             5.779 μs
  gctime:           377.396 ns (6.53%)
  memory:           4.67 KiB
  allocs:           26

combine_loop_EllipsisNotation:
BenchmarkTools.TrialEstimate:
  time:             5.685 μs
  gctime:           341.360 ns (6.00%)
  memory:           4.67 KiB
  allocs:           26

combine_cat:
BenchmarkTools.TrialEstimate:
  time:             37.058 μs
  gctime:           398.570 ns (1.08%)
  memory:           16.19 KiB
  allocs:           236

combine_cat_reduce:
BenchmarkTools.TrialEstimate:
  time:             66.158 μs
  gctime:           4.579 μs (6.92%)
  memory:           57.84 KiB
  allocs:           448

combine_SplitApplyCombine:
BenchmarkTools.TrialEstimate:
  time:             2.176 μs
  gctime:           189.162 ns (8.69%)
  memory:           3.55 KiB
  allocs:           2

combine_SplitApplyCombine_view:
BenchmarkTools.TrialEstimate:
  time:             1.718 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0


Large Dimensions (((200, 40, 200), (3, 3)))

combine_loop:
BenchmarkTools.TrialEstimate:
  time:             559.914 ms
  gctime:           185.278 ms (33.09%)
  memory:           292.97 MiB
  allocs:           1600003

combine_loop_CartesianIndices:
BenchmarkTools.TrialEstimate:
  time:             587.806 ms
  gctime:           201.019 ms (34.20%)
  memory:           292.97 MiB
  allocs:           1600003

combine_loop_EllipsisNotation:
BenchmarkTools.TrialEstimate:
  time:             547.953 ms
  gctime:           185.343 ms (33.82%)
  memory:           292.97 MiB
  allocs:           1600003

combine_cat:
Too much memory / too slow.

combine_cat_reduce:
BenchmarkTools.TrialEstimate: 
  time:             59.245 s
  gctime:           11.443 s (19.32%)
  memory:           87.26 GiB
  allocs:           31786991

combine_SplitApplyCombine:
BenchmarkTools.TrialEstimate:
  time:             550.919 ms
  gctime:           36.190 ms (6.57%)
  memory:           219.73 MiB
  allocs:           3

combine_SplitApplyCombine_view:
BenchmarkTools.TrialEstimate:
  time:             1.869 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0

My interpretation is:

  • Use views with the combinedimsview() wherever possible as it is simply the fastest solution.
  • Avoid cat and reduce(cat) in this case except for the smallest of arrays
  • If you really need a copy and don’t want to pull in a package, there is nothing wrong with the looping version
5 Likes

SplitApplyCombine is a pretty stable, reliable and performant package - shouldn’t be an issue to depend on it.

I would even suggest everyone working with arrays/tables/data in general to become familiar with it. There are very useful functions complementing Base in these aspects: combine/splitdims, different groups and joins, etc. These functions are lightweight and general, work with different kinds of arrays and with many Table implementations. [ I’m not involved in SplitApplyCombine development at all, just a happy user (: ]

4 Likes

Agreed, SplitApplyCombine is the best solution.

Still, concatenating arrays is basic functionality, it feels like something that should be covered in Base. It’s bit a shame that cat doesn’t work well with a large number of arrays.

1 Like

First time trying the TensorCast.jl package from @mcabbott and wonder if it could be an appropriate alternative:

using TensorCast

# returns a view, 2×3×4×3×3 Array{ComplexF64, 5}
@cast transf_result[i,j,k,l,m] := result[i,j,k][l,m];

# returns a copy, 2×3×4×3×3 Array{ComplexF64, 5}
@cast transf_result[i,j,k,l,m] |= result[i,j,k][l,m];
2 Likes