Better performance to blockwise matrix multiplication

mrapo · May 3, 2018, 9:25am

The following function makes it possible to perform blockwise matrix multiplication that results in a very large matrix. Basically the arrays are sliced into smaller blocks which are then multiplied and the final matrix is built from them.

"""
    block_multiply(X, R, A; n)

Perform matrix multiplication block-wise: B = X*R*X'*A
"""
function block_multiply(X, R, A; n=3)
    N, M = size(X)
    T = floor.(Int, linspace(0, N, n+1))
    spans = collect(T[i]+1:T[i+1] for i=1:n)
    B = similar(A)
    fill!(B, 0.0)
    local i, j
    for i=1:n
        for j=1:n
            a = spans[i]
            b = spans[j]
            B[a,:] += X[a,:]*R*transpose(X[b,:])*A[b,:]
        end
    end
    return B
end

Now the problem is, that this calculation still needs a huge amount of allocations. For instance the following performance test

 function perf_test(ndofs, nvals, nkeep, nblocks)
     X = rand(ndofs, nvals)
     R = rand(nvals, nvals)
     A = rand(ndofs, nkeep)
     @time B = block_multiply(X, R, A; n=nblocks)
 end

with following input: perf_test(30000, 1, 500; n=10)
performs as follows: 28.734395 seconds (2.81 k allocations: 11.297 GiB, 34.04% gc time) .

Any suggestions how I could make this function to perform better (use less allocations)?

tamasgal · May 3, 2018, 9:31am

You could refactor this line to do it explicitly, since this creates intermediate allocations in each loop:

kristoffer.carlsson · May 3, 2018, 9:55am

If the sizes of the matrices are representative here, remember that the order that you do the operations might be important. Here it should be enough to do the (X' * A) first but check out https://github.com/AustinPrivett/MatrixChainMultiply.jl.

julia> using MatrixChainMultiply

julia> function perf_test(ndofs, nvals, nkeep, nblocks)
           X = rand(ndofs, nvals)
           R = rand(nvals, nvals)
           A = rand(ndofs, nkeep)
           @time B = block_multiply(X, R, A; n=nblocks)
           gc()
           @time B = X * R * X' * A
           gc()
           @time B = X * R * (X' * A)
           gc()
           @time B = matrixchainmultiply(X, R, X', A)
       end
perf_test (generic function with 1 method)

julia> perf_test(30000, 1, 500, 10);
 13.832762 seconds (1.28 k allocations: 9.055 GiB, 9.13% gc time)
 16.128467 seconds (6 allocations: 6.818 GiB, 0.67% gc time)
  0.052784 seconds (5 allocations: 114.674 MiB, 17.54% gc time)
  0.055796 seconds (40 allocations: 114.451 MiB, 0.43% gc time)

(I ran this on the nightly build)

If you still need to do blocking, you should use views instead of slicing, since views do not copy, and also, up-front, allocate work buffers that the resulting blockmatrix multiplications are stored into.

Topic		Replies	Views
Memory allocation in matrix multiplication Performance	2	2111	July 8, 2020
How can I improve the speed of large matrix multiplication Performance question	6	1610	March 10, 2020
Optimize combination of element-wise products and matrix multiplication Performance	10	995	July 15, 2020
Is there a faster way to do a matrix multiplication on a random subset of a matrix? General Usage performance	5	292	April 16, 2023
Allocations with BlockArrays multiplication Performance	1	247	September 21, 2022

Better performance to blockwise matrix multiplication

Related topics