What is the best way to get a running matrix-product over a large 3D array

SanAlphaTau · February 1, 2025, 12:19pm

Heya,

I am doing some markov process simulations which output NxN matrices these are stored (currently) in a 3D array X[N, N, M) where M can be very large (1E6 or more).

My code currently uses a reduce statement reduce(*, eachslice(X; dims=3)) to get the overall matrix product – a single NxN matrix.

In the long term N could be quite large and in the long-long term these will be varying size rectangular matrices (M x N) * (N x P) * (P x Q) * … * (U x V) → (M x V)

Question 1: Would it be better to take a running product? I tried doing this straightforwardly with a loop and the time equivalent and allocations are double just storing the whole of X in memory and then reducing at the end. But perhaps I am doing it poorly?

using BenchmarkTools
using LinearAlgebra

function store_mem(N, G)
    X = zeros(N, N, G)
    for i in 1:G 
        C = rand(N, N)
        X[:, :, i] = C ./ sum(C, dims=1) # stochastic matrix
    end
    return reduce(*, eachslice(X; dims=3))
end

function store_inplace(N, G)
    X = Matrix(I, N, N)
    for i in 1:G 
        C = rand(N, N)
        X *= C ./ sum(C, dims=1) # stochastic matrix
    end
    return X
end

mem_bench = @benchmark store_mem(10, 1000000)
inplace_bench = @benchmark store_inplace(10, 1000000)

Question 2: A friend mentioned this was reminiscent of tensor contractions so I tried to use the Tullio.jl package but I am finding the notation quite confusing for the macro. My attempt (MWE below) I seem to be getting the multiplication of each entry not the multiplication of the matrices. I have looked over the list of example calls in the repo but I can’t seem to find an equivalent contraction. I assume if I go this route the advantage would disappear once the matrices become variable in size or is there a way to cope with this?

using Tullio 

c = rand(3, 3, 3)

truth = c[:, :, 1] * c[:, :, 2] * c[:, :, 3];
display(truth)

direct = reduce(*, eachslice(c; dims=3));
display(direct)

@tullio (*) einstein[i, j] := c[i, j, k];
display(einstein)

Mason · February 1, 2025, 2:28pm

I get about double the performance of your store_inplace by actually storing in-place rather than doing everything out-of-place

using Random
function store_inplace2(N, G)
    X     = Matrix{Float64}(I, N, N)
    X_old = similar(X)
    C     = similar(X)
    @inbounds for _ ∈ 1:G
        rand!(C)
        for j ∈ axes(C, 2)
			s = sum(view(C, :, j))
			@simd for i ∈ axes(C, 1)
				C[i, j] = C[i, j] / s
			end
        end
        X_old .= X
        mul!(X, X_old, C)
    end
    X
end

julia> @benchmark store_inplace(10, 1000000)
BenchmarkTools.Trial: 7 samples with 1 evaluation per sample.
 Range (min … max):  675.556 ms …    1.049 s  ┊ GC (min … max): 15.43% … 17.49%
 Time  (median):     688.002 ms               ┊ GC (median):    15.42%
 Time  (mean ± σ):   738.965 ms ± 136.959 ms  ┊ GC (mean ± σ):  15.95% ±  0.79%

  ▁█▁█                                                        ▁  
  ████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  676 ms           Histogram: frequency by time          1.05 s <

 Memory estimate: 2.79 GiB, allocs estimate: 8000002.


julia> @benchmark store_inplace2(10, 1000000)
BenchmarkTools.Trial: 13 samples with 1 evaluation per sample.
 Range (min … max):  387.539 ms … 414.597 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     394.907 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   395.256 ms ±   7.108 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ █ █ █ █   █   ███ █    █    █                             █  
  █▁█▁█▁█▁█▁▁▁█▁▁▁███▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  388 ms           Histogram: frequency by time          415 ms <

 Memory estimate: 2.77 KiB, allocs estimate: 6.

rmsmsgood · February 1, 2025, 3:02pm

`prod`

You said dimension of matices will be varied. There is prod function for array of matrices. Please chek following:

julia> Y = [rand(3, 2), rand(2, 5), rand(5, 4)]; 

julia> prod(Y)
3×4 Matrix{Float64}:
 0.681755  1.75165  1.27657  1.08206
 0.780068  1.5965   1.24478  1.13215
 0.895299  1.83247  1.42873  1.29943

Even you found better way in this stage, eventually you should give up tensor style if you want to generalize the size of matrices.

Time

In performance, I guess there is no significant difference between them(of course I know you should handle large M), becasue the implementation of prod also uses reduce and * as I know. At least, prod will be more convenient.

# M = 100
julia> @btime reduce(*, eachslice($X; dims=3));  
  16.200 μs (199 allocations: 91.33 KiB)
julia> @btime prod($Y);
  15.900 μs (198 allocations: 91.27 KiB)

# M = 1000
julia> @btime reduce(*, eachslice($X; dims=3));  
  163.900 μs (1999 allocations: 921.02 KiB)
julia> @btime prod($Y);
  163.600 μs (1998 allocations: 920.95 KiB)

Space

One more thing. Vector{Matrix} is x100 efficient to store.

julia> M = 1000000
1000000

julia> X = randn(N, N, M);

julia> Y = [randn(N, N) for _ in 1:M];

julia> sizeof(X)
800000000

julia> sizeof(Y)
8000000

show whole test code

Y = [rand(3, 2), rand(2, 5), rand(5, 4)];
prod(Y)

using BenchmarkTools

N = 10
M = 1000 # or 100

X = randn(N, N, M)
Y = [randn(N, N) for _ in 1:M]

@btime reduce(*, eachslice($X; dims=3));
@btime prod($Y);

M = 1000000
X = randn(N, N, M);
Y = [randn(N, N) for _ in 1:M];
sizeof(X)
sizeof(Y)

SanAlphaTau · February 1, 2025, 5:43pm

Thank you, the actually in-place code example was useful. I realized my version wasn’t in-place from my benchmark. But I couldn’t find any info on why or what to do better. So your changes are massively appreciated.

SanAlphaTau · February 1, 2025, 5:46pm

Thanks for the suggestion I will switch to Vector{Matrix} I hadn’t realized the memory allocation was so different.

So from what you said it seems prod() is syntactically nicer but still the same essential reduction?

From this and @Mason I gather I should just do things in-place properly and that is the best I can hope for?

stevengj · February 1, 2025, 8:20pm

No it’s not. When you do sizeof(Y) it’s only counting the size of the array of pointers, not the size of the pointed-to memory. You can use Base.summarysize to get the total reachable size:

julia> N, M = 10, 1000;

julia> X = randn(N, N, M);

julia> Y = [randn(N, N) for _ in 1:M];

julia> sizeof(X), sizeof(Y)
(800000, 8000)

julia> Base.summarysize(X), Base.summarysize(Y)
(800056, 856040)

If all the arrays are the same size then a 3d array will take slightly less memory than an array of matrices (because the 3d array doesn’t have a separate object for each matrix). An array of matrices is more flexible in other ways, however — it is easier to change dynamically, e.g. to resize or to re-order, and it can also hold matrices of heterogeneous shapes.

It will be more efficient allocate the result matrix once and accumulate the running product in-place with LinearAlgebra.mul!, e.g.

using LinearAlgebra
function myprod(X::AbstractArray{T, 3}) where {T<:Number}
    P = Matrix{T}(I, size(X,1), size(X,2))
    N = LinearAlgebra.checksquare(P)
    A = similar(P) # temporary storage
    for i in axes(X, 3)
        @views mul!(A, P, X[:,:,i])
        P, A = A, P
    end
    return P
end

This allocates much less memory (though the time is only slightly improved if N is sufficiently large, since in that case the time is dominated by the O(MN^3) matrix multiplication cost).

For example, here is an in-place version of your running sum, with no allocations inside the loop:

julia> function store_inplace(N, G)
           X = Matrix{Float64}(I, N, N)
           A = similar(X)
           C = similar(X)
           for i in 1:G 
               C .= rand.()
               for col in eachslice(C, dims=1)
                   col ./= sum(col)
               end
               mul!(A, X, C)
               X, A = A, X
           end
           return X
       end
store_inplace (generic function with 1 method)

julia> @btime store_inplace(10, 1000000);
  543.341 ms (6 allocations: 2.77 KiB)

julia> @btime store_mem(10, 1000000);
  1.519 s (8000002 allocations: 3.53 GiB)

and it is about 3x faster than store_mem.

Whoops, I see that @Mason already commented on this. Note that the X, A = A, X saves a copy compared to X .= A, though the timing difference is negligible since the mul! cost dominates. (This is what I get for not reading the whole thread.)

Dan · February 1, 2025, 9:31pm

If the matrices will have different dimensions, then there is more to be gained by using associativity of matrix multiplication (that is placing parenthesis) cleverly. If the dimensions are really varied this will be a big gain in performance and storage.

A quick example:

julia> Y = [rand(2, 2), rand(2,100), rand(100, 2)];

julia> @btime prod($Y);
  467.474 ns (4 allocations: 1.75 KiB)

julia> @btime (Y[1]*(Y[2]*Y[3]));
  289.145 ns (4 allocations: 224 bytes)

SanAlphaTau · February 2, 2025, 4:10pm

So are there any algorithms which will order those for me.

If I have a long list of rectangular mats and then call prod() on it for example.

I assume the point is largely doing the largest multi first to try to minimise the impact on the whole chain?

SanAlphaTau · February 2, 2025, 4:14pm

Thanks for the reply. The copy save is indeed helpful!

stevengj · February 2, 2025, 4:44pm

Basically you want to multiply small matrices by large matrices, not large matrices by large matrices.

Topic		Replies	Views
Tensor multiplication Performance arrays , tensors	9	6543	March 20, 2019
Matrix Multiplication of a Large Number of Small Matrices Performance linearalgebra , tullio , column-major	12	1288	February 8, 2023
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2462	July 13, 2020
Broadcasting matrix-vector multiplication over specific axis for n-dimensional arrays Performance	3	497	August 9, 2022
Matrix-vector product faster than matrix addition? Performance question , linearalgebra , benchmark	5	902	September 28, 2022

What is the best way to get a running matrix-product over a large 3D array

prod

Time

Space

Related topics

`prod`