# Better performance to blockwise matrix multiplication

#1

The following function makes it possible to perform blockwise matrix multiplication that results in a very large matrix. Basically the arrays are sliced into smaller blocks which are then multiplied and the final matrix is built from them.

``````"""
block_multiply(X, R, A; n)

Perform matrix multiplication block-wise: B = X*R*X'*A
"""
function block_multiply(X, R, A; n=3)
N, M = size(X)
T = floor.(Int, linspace(0, N, n+1))
spans = collect(T[i]+1:T[i+1] for i=1:n)
B = similar(A)
fill!(B, 0.0)
local i, j
for i=1:n
for j=1:n
a = spans[i]
b = spans[j]
B[a,:] += X[a,:]*R*transpose(X[b,:])*A[b,:]
end
end
return B
end
``````

Now the problem is, that this calculation still needs a huge amount of allocations. For instance the following performance test

`````` function perf_test(ndofs, nvals, nkeep, nblocks)
X = rand(ndofs, nvals)
R = rand(nvals, nvals)
A = rand(ndofs, nkeep)
@time B = block_multiply(X, R, A; n=nblocks)
end
``````

with following input: `perf_test(30000, 1, 500; n=10)`
performs as follows: `28.734395 seconds (2.81 k allocations: 11.297 GiB, 34.04% gc time)` .

Any suggestions how I could make this function to perform better (use less allocations)?

#2

You could refactor this line to do it explicitly, since this creates intermediate allocations in each loop:

#3

If the sizes of the matrices are representative here, remember that the order that you do the operations might be important. Here it should be enough to do the `(X' * A)` first but check out https://github.com/AustinPrivett/MatrixChainMultiply.jl.

``````julia> using MatrixChainMultiply

julia> function perf_test(ndofs, nvals, nkeep, nblocks)
X = rand(ndofs, nvals)
R = rand(nvals, nvals)
A = rand(ndofs, nkeep)
@time B = block_multiply(X, R, A; n=nblocks)
gc()
@time B = X * R * X' * A
gc()
@time B = X * R * (X' * A)
gc()
@time B = matrixchainmultiply(X, R, X', A)
end
perf_test (generic function with 1 method)

julia> perf_test(30000, 1, 500, 10);
13.832762 seconds (1.28 k allocations: 9.055 GiB, 9.13% gc time)
16.128467 seconds (6 allocations: 6.818 GiB, 0.67% gc time)
0.052784 seconds (5 allocations: 114.674 MiB, 17.54% gc time)
0.055796 seconds (40 allocations: 114.451 MiB, 0.43% gc time)
``````

(I ran this on the nightly build)

If you still need to do blocking, you should use `view`s instead of slicing, since views do not copy, and also, up-front, allocate work buffers that the resulting blockmatrix multiplications are stored into.