# Get intuition on how to improve matmul code

Hello,

After playing a little bit with the most naive matmul blocked version I have found on the internet I was wondering what could I do to improve the performance.

``````function matmul_blocked_1!(A, B, C)
bs = 10
n = size(A,1)

@inbounds for kk in 1:bs:n               # iterates over cols of A (rows of B)
for jj in 1:bs:n                     # iterates over rows of B

for i in 1:n                     # pick slice A[i,kk:kk+bs]
for j in jj:jj+bs-1           # Make dot product  A[i,kk:kk+bs] * B_block[:,j] for all j
s = C[i,j]
for k in kk:kk+bs-1
s += A[i,k] * B[k,j]
end
C[i,j] =s
end
end
end
end
# nothing returned, C updated
end
``````

I was suspecting that the function above could have read-write performance problems so Iβve time it:

``````using TimerOutputs
function matmul_blocked_1!(A, B, C)
bs = 10
n = size(A,1)

@inbounds for kk in 1:bs:n               # iterates over cols of A (rows of B)
for jj in 1:bs:n                     # iterates over rows of B

@timeit "for i" for i in 1:n                     # pick slice A[i,kk:kk+bs]
@timeit "for j" for j in jj:jj+bs-1           # Make dot product  A[i,kk:kk+bs] * B_block[:,j] for all j
@timeit "get C[i,j]" s = C[i,j]
@timeit "scalar product" for k in kk:kk+bs-1
s += A[i,k] * B[k,j]
end
@timeit "store s to C[i,j]" C[i,j] = s
end
end
end
end
# nothing returned, C updated
end
reset_timer!()
N = 1000
A = rand(N,N)
B = rand(N,N)
C2 = zeros(size(A));
matmul_blocked_1!(A,B,C2)
print_timer()
``````

This is what I get:

`````` ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time                   Allocations
ββββββββββββββββββββββ   βββββββββββββββββββββββ
Tot / % measured:            58.8s / 100%            4.63GiB / 100%

Section               ncalls     time   %tot     avg     alloc   %tot      avg
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
for i                  10.0k    58.8s   100%  5.88ms   4.62GiB  100%    484KiB
for j                10.0M    57.7s  98.1%  5.77ΞΌs   4.47GiB  96.8%     480B
scalar product      100M    8.47s  14.4%  84.7ns     0.00B  0.00%    0.00B
get C[i,j]          100M    7.43s  12.6%  74.3ns     0.00B  0.00%    0.00B
store s to C[i,j]   100M    6.78s  11.5%  67.8ns     0.00B  0.00%    0.00B
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
``````

Surprisingly (for me) reading `s = C[i,j]` and writting `C[i,j] = s` use more time than the actual computation part of the scalar product.

In this sorts of situations, besides βrethinking the algorithmβ, what kind of transformations do you try? (such as reordering loops).

I should add that Iβm not trying to have the best matmul implementation possible (even though having several improvements over this baseline would help me a lot trying to understand how I can get more performance). I am mostly interested in the βthough processβ people have. I am a bit lost sometimes when I face this type of issues . This is a good example that ilustrates working with slices of Arrays.

If you are on a recent version of Julia, perhaps the results are just not correct:

This module exports a `@timeit` macro that works similarly to the `%timeit` magic in IPython.
THIS PACKAGE IS DEPRECATED: It no longer works correctly on Julia v0.7+ due to scoping changes in Julia. Use `@btime` from BenchmarkTools.jl instead.

What module are you referring to in that quote? I donβt think TimerOutputs.jl is deprecated.

Sorry - I missed the `using` in the second version of the code.
I thought you were using `Timeit.jl` (which exports a @timeit macro).

1 Like

I think you are βtimingβ the timing, so to speak. Why else would there be 4GB of allocations?

Adding a `@simd` annotation if possible will help a lot.

In this case it is twice as slow. I suspect that the loop `for k in kk:kk+bs-1` is too tiny to make any sensible speedup. I tried to manually load a SIMD vector to do the dot product but I was not able to do it. Maybe someone could give me a hand!

Is it reading and writing though?
I have a feeling that timing inner operation is what actually causes that time and allocation overhead in `for` blocks.

I am not sure whether I am using `@timeit` in the appropiate way. That would be an important place to start.

I would profile it. Have you tried that?

I have some issues with the default profiler (Itβs hard to understand for me). I think that what I did is a reasonable way to profile the code. The % of time spend in the different parts should tell me what is worth reconsidering. What I am not sure is whether I am using TimerOutputs correctly.

Try using ProfileViews, it gives a really nice display of the default profile.

1 Like

Also, worth noting is that a naive multiplication is very similar performance-wise. I tested on 400x400, where the naive version is only 10% slower.

``````function naiveMult!(C,A,B)
@inbounds for i in 1:size(A,1)
for j in 1:size(B,2)
@simd for z in 1:size(B,1)
C[i,j] += A[i,z]*B[z,j]
end
end
end
return C
end``````

You version might use SIMD, in theory, a pretty similar version of what I wrote should give me a decent speedup.

There is a course that goes into depth on how to speedup this kind of code

http://www.cs.utexas.edu/users/flame/laff/pfhp/index.html

I am just tring to βlearn the morale of the storyβ behind the different βtricksβ used in this example. Such as working with smaller vectors to use more efficiently the cache, etc

Try changing `bs` to 16. That gave me a free factor of 2 in speed. I think the advantage comes from 16 being a multiple of cache-line size which can matter a lot.