The following function makes it possible to perform blockwise matrix multiplication that results in a very large matrix. Basically the arrays are sliced into smaller blocks which are then multiplied and the final matrix is built from them.
""" block_multiply(X, R, A; n) Perform matrix multiplication block-wise: B = X*R*X'*A """ function block_multiply(X, R, A; n=3) N, M = size(X) T = floor.(Int, linspace(0, N, n+1)) spans = collect(T[i]+1:T[i+1] for i=1:n) B = similar(A) fill!(B, 0.0) local i, j for i=1:n for j=1:n a = spans[i] b = spans[j] B[a,:] += X[a,:]*R*transpose(X[b,:])*A[b,:] end end return B end
Now the problem is, that this calculation still needs a huge amount of allocations. For instance the following performance test
function perf_test(ndofs, nvals, nkeep, nblocks) X = rand(ndofs, nvals) R = rand(nvals, nvals) A = rand(ndofs, nkeep) @time B = block_multiply(X, R, A; n=nblocks) end
with following input:
perf_test(30000, 1, 500; n=10)
performs as follows:
28.734395 seconds (2.81 k allocations: 11.297 GiB, 34.04% gc time) .
Any suggestions how I could make this function to perform better (use less allocations)?