Hello,

After playing a little bit with the most naive matmul blocked version I have found on the internet I was wondering what could I do to improve the performance.

```
function matmul_blocked_1!(A, B, C)
bs = 10
n = size(A,1)
@inbounds for kk in 1:bs:n # iterates over cols of A (rows of B)
for jj in 1:bs:n # iterates over rows of B
for i in 1:n # pick slice A[i,kk:kk+bs]
for j in jj:jj+bs-1 # Make dot product A[i,kk:kk+bs] * B_block[:,j] for all j
s = C[i,j]
for k in kk:kk+bs-1
s += A[i,k] * B[k,j]
end
C[i,j] =s
end
end
end
end
# nothing returned, C updated
end
```

I was suspecting that the function above could have read-write performance problems so Iβve time it:

```
using TimerOutputs
function matmul_blocked_1!(A, B, C)
bs = 10
n = size(A,1)
@inbounds for kk in 1:bs:n # iterates over cols of A (rows of B)
for jj in 1:bs:n # iterates over rows of B
@timeit "for i" for i in 1:n # pick slice A[i,kk:kk+bs]
@timeit "for j" for j in jj:jj+bs-1 # Make dot product A[i,kk:kk+bs] * B_block[:,j] for all j
@timeit "get C[i,j]" s = C[i,j]
@timeit "scalar product" for k in kk:kk+bs-1
s += A[i,k] * B[k,j]
end
@timeit "store s to C[i,j]" C[i,j] = s
end
end
end
end
# nothing returned, C updated
end
reset_timer!()
N = 1000
A = rand(N,N)
B = rand(N,N)
C2 = zeros(size(A));
matmul_blocked_1!(A,B,C2)
print_timer()
```

This is what I get:

```
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time Allocations
ββββββββββββββββββββββ βββββββββββββββββββββββ
Tot / % measured: 58.8s / 100% 4.63GiB / 100%
Section ncalls time %tot avg alloc %tot avg
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
for i 10.0k 58.8s 100% 5.88ms 4.62GiB 100% 484KiB
for j 10.0M 57.7s 98.1% 5.77ΞΌs 4.47GiB 96.8% 480B
scalar product 100M 8.47s 14.4% 84.7ns 0.00B 0.00% 0.00B
get C[i,j] 100M 7.43s 12.6% 74.3ns 0.00B 0.00% 0.00B
store s to C[i,j] 100M 6.78s 11.5% 67.8ns 0.00B 0.00% 0.00B
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```

Surprisingly (for me) reading `s = C[i,j]`

and writting `C[i,j] = s`

use more time than the actual computation part of the scalar product.

In this sorts of situations, besides βrethinking the algorithmβ, what kind of transformations do you try? (such as reordering loops).

I should add that Iβm not trying to have the best matmul implementation possible (even though having several improvements over this baseline would help me a lot trying to understand how I can get more performance). I am mostly interested in the βthough processβ people have. I am a bit lost sometimes when I face this type of issues . This is a good example that ilustrates working with slices of Arrays.