Better performance to blockwise matrix multiplication

memory-allocation

#1

The following function makes it possible to perform blockwise matrix multiplication that results in a very large matrix. Basically the arrays are sliced into smaller blocks which are then multiplied and the final matrix is built from them.

"""
    block_multiply(X, R, A; n)

Perform matrix multiplication block-wise: B = X*R*X'*A
"""
function block_multiply(X, R, A; n=3)
    N, M = size(X)
    T = floor.(Int, linspace(0, N, n+1))
    spans = collect(T[i]+1:T[i+1] for i=1:n)
    B = similar(A)
    fill!(B, 0.0)
    local i, j
    for i=1:n
        for j=1:n
            a = spans[i]
            b = spans[j]
            B[a,:] += X[a,:]*R*transpose(X[b,:])*A[b,:]
        end
    end
    return B
end

Now the problem is, that this calculation still needs a huge amount of allocations. For instance the following performance test

 function perf_test(ndofs, nvals, nkeep, nblocks)
     X = rand(ndofs, nvals)
     R = rand(nvals, nvals)
     A = rand(ndofs, nkeep)
     @time B = block_multiply(X, R, A; n=nblocks)
 end

with following input: perf_test(30000, 1, 500; n=10)
performs as follows: 28.734395 seconds (2.81 k allocations: 11.297 GiB, 34.04% gc time) .

Any suggestions how I could make this function to perform better (use less allocations)?


#2

You could refactor this line to do it explicitly, since this creates intermediate allocations in each loop:


#3

If the sizes of the matrices are representative here, remember that the order that you do the operations might be important. Here it should be enough to do the (X' * A) first but check out https://github.com/AustinPrivett/MatrixChainMultiply.jl.

julia> using MatrixChainMultiply

julia> function perf_test(ndofs, nvals, nkeep, nblocks)
           X = rand(ndofs, nvals)
           R = rand(nvals, nvals)
           A = rand(ndofs, nkeep)
           @time B = block_multiply(X, R, A; n=nblocks)
           gc()
           @time B = X * R * X' * A
           gc()
           @time B = X * R * (X' * A)
           gc()
           @time B = matrixchainmultiply(X, R, X', A)
       end
perf_test (generic function with 1 method)

julia> perf_test(30000, 1, 500, 10);
 13.832762 seconds (1.28 k allocations: 9.055 GiB, 9.13% gc time)
 16.128467 seconds (6 allocations: 6.818 GiB, 0.67% gc time)
  0.052784 seconds (5 allocations: 114.674 MiB, 17.54% gc time)
  0.055796 seconds (40 allocations: 114.451 MiB, 0.43% gc time)

(I ran this on the nightly build)

If you still need to do blocking, you should use views instead of slicing, since views do not copy, and also, up-front, allocate work buffers that the resulting blockmatrix multiplications are stored into.