Get intuition on how to improve matmul code

davidbp · November 6, 2019, 4:48pm

Hello,

After playing a little bit with the most naive matmul blocked version I have found on the internet I was wondering what could I do to improve the performance.

function matmul_blocked_1!(A, B, C)
    bs = 10
    n = size(A,1)

    @inbounds for kk in 1:bs:n               # iterates over cols of A (rows of B)
        for jj in 1:bs:n                     # iterates over rows of B    

            for i in 1:n                     # pick slice A[i,kk:kk+bs]
               for j in jj:jj+bs-1           # Make dot product  A[i,kk:kk+bs] * B_block[:,j] for all j
                   s = C[i,j]
                   for k in kk:kk+bs-1
                      s += A[i,k] * B[k,j]
                   end
                   C[i,j] =s
               end
            end
        end
    end
    # nothing returned, C updated
end

I was suspecting that the function above could have read-write performance problems so I’ve time it:

using TimerOutputs
function matmul_blocked_1!(A, B, C)
    bs = 10
    n = size(A,1)

    @inbounds for kk in 1:bs:n               # iterates over cols of A (rows of B)
         for jj in 1:bs:n                     # iterates over rows of B    

             @timeit "for i" for i in 1:n                     # pick slice A[i,kk:kk+bs]
               @timeit "for j" for j in jj:jj+bs-1           # Make dot product  A[i,kk:kk+bs] * B_block[:,j] for all j
                   @timeit "get C[i,j]" s = C[i,j]
                   @timeit "scalar product" for k in kk:kk+bs-1
                      s += A[i,k] * B[k,j]
                   end
                   @timeit "store s to C[i,j]" C[i,j] = s
               end
            end
        end
    end
    # nothing returned, C updated
end
reset_timer!()
N = 1000
A = rand(N,N)
B = rand(N,N)
C2 = zeros(size(A));
matmul_blocked_1!(A,B,C2)
print_timer()

This is what I get:

 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations      
                               ──────────────────────   ───────────────────────
       Tot / % measured:            58.8s / 100%            4.63GiB / 100%     

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 for i                  10.0k    58.8s   100%  5.88ms   4.62GiB  100%    484KiB
   for j                10.0M    57.7s  98.1%  5.77μs   4.47GiB  96.8%     480B
     scalar product      100M    8.47s  14.4%  84.7ns     0.00B  0.00%    0.00B
     get C[i,j]          100M    7.43s  12.6%  74.3ns     0.00B  0.00%    0.00B
     store s to C[i,j]   100M    6.78s  11.5%  67.8ns     0.00B  0.00%    0.00B
 ──────────────────────────────────────────────────────────────────────────────

Surprisingly (for me) reading s = C[i,j] and writting C[i,j] = s use more time than the actual computation part of the scalar product.

In this sorts of situations, besides “rethinking the algorithm”, what kind of transformations do you try? (such as reordering loops).

I should add that I’m not trying to have the best matmul implementation possible (even though having several improvements over this baseline would help me a lot trying to understand how I can get more performance). I am mostly interested in the “though process” people have. I am a bit lost sometimes when I face this type of issues . This is a good example that ilustrates working with slices of Arrays.

hendri54 · November 6, 2019, 4:59pm

If you are on a recent version of Julia, perhaps the results are just not correct:

This module exports a @timeit macro that works similarly to the %timeit magic in IPython.
THIS PACKAGE IS DEPRECATED: It no longer works correctly on Julia v0.7+ due to scoping changes in Julia. Use @btime from BenchmarkTools.jl instead.

ericphanson · November 6, 2019, 5:03pm

What module are you referring to in that quote? I don’t think TimerOutputs.jl is deprecated.

hendri54 · November 6, 2019, 5:08pm

Sorry - I missed the using in the second version of the code.
I thought you were using Timeit.jl (which exports a @timeit macro).

PetrKryslUCSD · November 6, 2019, 5:13pm

I think you are “timing” the timing, so to speak. Why else would there be 4GB of allocations?

Oscar_Smith · November 6, 2019, 5:14pm

Adding a @simd annotation if possible will help a lot.

davidbp · November 6, 2019, 5:16pm

In this case it is twice as slow. I suspect that the loop for k in kk:kk+bs-1 is too tiny to make any sensible speedup. I tried to manually load a SIMD vector to do the dot product but I was not able to do it. Maybe someone could give me a hand!

Vasily_Pisarev · November 6, 2019, 5:17pm

Is it reading and writing though?
I have a feeling that timing inner operation is what actually causes that time and allocation overhead in for blocks.

davidbp · November 6, 2019, 5:18pm

I am not sure whether I am using @timeit in the appropiate way. That would be an important place to start.

PetrKryslUCSD · November 6, 2019, 5:19pm

I would profile it. Have you tried that?

davidbp · November 6, 2019, 5:20pm

I have some issues with the default profiler (It’s hard to understand for me). I think that what I did is a reasonable way to profile the code. The % of time spend in the different parts should tell me what is worth reconsidering. What I am not sure is whether I am using TimerOutputs correctly.

Oscar_Smith · November 6, 2019, 5:21pm

Try using ProfileViews, it gives a really nice display of the default profile.

Oscar_Smith · November 6, 2019, 5:26pm

Also, worth noting is that a naive multiplication is very similar performance-wise. I tested on 400x400, where the naive version is only 10% slower.

function naiveMult!(C,A,B)
    @inbounds for i in 1:size(A,1)
        for j in 1:size(B,2)
            @simd for z in 1:size(B,1)
                C[i,j] += A[i,z]*B[z,j]
            end
        end
    end
    return C
end

davidbp · November 6, 2019, 5:29pm

You version might use SIMD, in theory, a pretty similar version of what I wrote should give me a decent speedup.

There is a course that goes into depth on how to speedup this kind of code

http://www.cs.utexas.edu/users/flame/laff/pfhp/index.html

I am just tring to “learn the morale of the story” behind the different “tricks” used in this example. Such as working with smaller vectors to use more efficiently the cache, etc

Oscar_Smith · November 6, 2019, 5:31pm

Try changing bs to 16. That gave me a free factor of 2 in speed. I think the advantage comes from 16 being a multiple of cache-line size which can matter a lot.

Topic		Replies	Views
Better performance to blockwise matrix multiplication Performance memory-allocation	2	1449	May 3, 2018
Allocations with BlockArrays multiplication Performance	1	251	September 21, 2022
Why mul! is so fast? General Usage question , linearalgebra	7	6868	November 26, 2019
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	418	November 7, 2022
CUDA matmul performance GPU question , performance	11	1493	August 21, 2020

Get intuition on how to improve matmul code

Related topics