The following function makes it possible to perform blockwise matrix multiplication that results in a very large matrix. Basically the arrays are sliced into smaller blocks which are then multiplied and the final matrix is built from them.

```
"""
block_multiply(X, R, A; n)
Perform matrix multiplication block-wise: B = X*R*X'*A
"""
function block_multiply(X, R, A; n=3)
N, M = size(X)
T = floor.(Int, linspace(0, N, n+1))
spans = collect(T[i]+1:T[i+1] for i=1:n)
B = similar(A)
fill!(B, 0.0)
local i, j
for i=1:n
for j=1:n
a = spans[i]
b = spans[j]
B[a,:] += X[a,:]*R*transpose(X[b,:])*A[b,:]
end
end
return B
end
```

Now the problem is, that this calculation still needs a huge amount of allocations. For instance the following performance test

```
function perf_test(ndofs, nvals, nkeep, nblocks)
X = rand(ndofs, nvals)
R = rand(nvals, nvals)
A = rand(ndofs, nkeep)
@time B = block_multiply(X, R, A; n=nblocks)
end
```

with following input: `perf_test(30000, 1, 500; n=10)`

performs as follows: `28.734395 seconds (2.81 k allocations: 11.297 GiB, 34.04% gc time)`

.

Any suggestions how I could make this function to perform better (use less allocations)?