Get intuition on how to improve matmul code

Hello,

After playing a little bit with the most naive matmul blocked version I have found on the internet I was wondering what could I do to improve the performance.

function matmul_blocked_1!(A, B, C)
    bs = 10
    n = size(A,1)

    @inbounds for kk in 1:bs:n               # iterates over cols of A (rows of B)
        for jj in 1:bs:n                     # iterates over rows of B    

            for i in 1:n                     # pick slice A[i,kk:kk+bs]
               for j in jj:jj+bs-1           # Make dot product  A[i,kk:kk+bs] * B_block[:,j] for all j
                   s = C[i,j]
                   for k in kk:kk+bs-1
                      s += A[i,k] * B[k,j]
                   end
                   C[i,j] =s
               end
            end
        end
    end
    # nothing returned, C updated
end

I was suspecting that the function above could have read-write performance problems so I’ve time it:

using TimerOutputs
function matmul_blocked_1!(A, B, C)
    bs = 10
    n = size(A,1)

    @inbounds for kk in 1:bs:n               # iterates over cols of A (rows of B)
         for jj in 1:bs:n                     # iterates over rows of B    

             @timeit "for i" for i in 1:n                     # pick slice A[i,kk:kk+bs]
               @timeit "for j" for j in jj:jj+bs-1           # Make dot product  A[i,kk:kk+bs] * B_block[:,j] for all j
                   @timeit "get C[i,j]" s = C[i,j]
                   @timeit "scalar product" for k in kk:kk+bs-1
                      s += A[i,k] * B[k,j]
                   end
                   @timeit "store s to C[i,j]" C[i,j] = s
               end
            end
        end
    end
    # nothing returned, C updated
end
reset_timer!()
N = 1000
A = rand(N,N)
B = rand(N,N)
C2 = zeros(size(A));
matmul_blocked_1!(A,B,C2)
print_timer()

This is what I get:

 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations      
                               ──────────────────────   ───────────────────────
       Tot / % measured:            58.8s / 100%            4.63GiB / 100%     

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 for i                  10.0k    58.8s   100%  5.88ms   4.62GiB  100%    484KiB
   for j                10.0M    57.7s  98.1%  5.77ΞΌs   4.47GiB  96.8%     480B
     scalar product      100M    8.47s  14.4%  84.7ns     0.00B  0.00%    0.00B
     get C[i,j]          100M    7.43s  12.6%  74.3ns     0.00B  0.00%    0.00B
     store s to C[i,j]   100M    6.78s  11.5%  67.8ns     0.00B  0.00%    0.00B
 ──────────────────────────────────────────────────────────────────────────────

Surprisingly (for me) reading s = C[i,j] and writting C[i,j] = s use more time than the actual computation part of the scalar product.

In this sorts of situations, besides β€œrethinking the algorithm”, what kind of transformations do you try? (such as reordering loops).

I should add that I’m not trying to have the best matmul implementation possible (even though having several improvements over this baseline would help me a lot trying to understand how I can get more performance). I am mostly interested in the β€œthough process” people have. I am a bit lost sometimes when I face this type of issues . This is a good example that ilustrates working with slices of Arrays.

If you are on a recent version of Julia, perhaps the results are just not correct:

This module exports a @timeit macro that works similarly to the %timeit magic in IPython.
THIS PACKAGE IS DEPRECATED: It no longer works correctly on Julia v0.7+ due to scoping changes in Julia. Use @btime from BenchmarkTools.jl instead.

What module are you referring to in that quote? I don’t think TimerOutputs.jl is deprecated.

Sorry - I missed the using in the second version of the code.
I thought you were using Timeit.jl (which exports a @timeit macro).

1 Like

I think you are β€œtiming” the timing, so to speak. Why else would there be 4GB of allocations?

Adding a @simd annotation if possible will help a lot.

In this case it is twice as slow. I suspect that the loop for k in kk:kk+bs-1 is too tiny to make any sensible speedup. I tried to manually load a SIMD vector to do the dot product but I was not able to do it. Maybe someone could give me a hand!

Is it reading and writing though?
I have a feeling that timing inner operation is what actually causes that time and allocation overhead in for blocks.

I am not sure whether I am using @timeit in the appropiate way. That would be an important place to start.

I would profile it. Have you tried that?

I have some issues with the default profiler (It’s hard to understand for me). I think that what I did is a reasonable way to profile the code. The % of time spend in the different parts should tell me what is worth reconsidering. What I am not sure is whether I am using TimerOutputs correctly.

Try using ProfileViews, it gives a really nice display of the default profile.

1 Like

Also, worth noting is that a naive multiplication is very similar performance-wise. I tested on 400x400, where the naive version is only 10% slower.

function naiveMult!(C,A,B)
    @inbounds for i in 1:size(A,1)
        for j in 1:size(B,2)
            @simd for z in 1:size(B,1)
                C[i,j] += A[i,z]*B[z,j]
            end
        end
    end
    return C
end

You version might use SIMD, in theory, a pretty similar version of what I wrote should give me a decent speedup.

There is a course that goes into depth on how to speedup this kind of code

http://www.cs.utexas.edu/users/flame/laff/pfhp/index.html

I am just tring to β€œlearn the morale of the story” behind the different β€œtricks” used in this example. Such as working with smaller vectors to use more efficiently the cache, etc

Try changing bs to 16. That gave me a free factor of 2 in speed. I think the advantage comes from 16 being a multiple of cache-line size which can matter a lot.