How to write multithreading matrix multiplication

Hello! I’m studying parallel programming in Julia, so i decided to write some basics matrix multiplication without using Linear Algebra package, because it’s already multi-threaded.

using Base.Threads
using BenchmarkTools

function multiplyMatrices_oneThreadLoop(A::Matrix{Float64}, B::Matrix{Float64}, N::Int64)
        C = zeros(N, N)
        Threads.@threads for i in 1:N
                for j in 1:N
                        for k in 1:N
                                C[i, j]= A[i, k] * B[k, j]
                        end
                end

        end
        return C
end


function multiplyMatrices_spawnExample(A::Matrix{Float64}, B::Matrix{Float64}, N::Int64)
        C = zeros(N, N)
         @sync Threads.@spawn for i in 1:N
                for j in 1:N
                        for k in 1:N
                                C[i, j]= A[i, k] * B[k, j]
                        end
                end

        end
        return C
end



function multiplyMatrices_default(A::Matrix{Float64}, B::Matrix{Float64}, N::Int64)
        C = zeros(N,N)
        for i in 1:N
                for j in 1:N
                        for k in 1:N
                                C[i, j]= A[i, k] * B[k, j]
                        end
                end

        end
        return C
end

N = 5000
A = rand(N, N);
B = rand(N, N);
println("multi-threaded loop 1st run")
@btime multiplyMatrices_oneThreadLoop(A, B, N)
println("using sync spawn 1st run")
@btime multiplyMatrices_spawnExample(A,B,N)
println("default multiplication 1st run")
@btime multiplyMatrices_default(A, B, N)
println("multi-threaded loop 2nd run")
@btime multiplyMatrices_oneThreadLoop(A, B, N)
println("using sync spawn 2nd run")
@btime multiplyMatrices_spawnExample(A,B,N)
println("default multiplication 2nd run")
@btime multiplyMatrices_default(A, B, N)
println("multi-threaded loop 3rd run")
@btime multiplyMatrices_oneThreadLoop(A, B, N)
println("using sync spawn 3rd run")
@btime multiplyMatrices_spawnExample(A,B,N)
println("default multiplication 3rd run")
@btime multiplyMatrices_default(A, B, N)

when i run this code with julia -t 8, i saw that

  1. Performance of thread.@spawn function is slower than default multiplication
  2. All functions have higher performance on first run than on repeated runs, why it happens?

Where i did mistakes and how to fix them?

1 Like

Hi,

one mistake is to call @btime multiplyMatrices_oneThreadLoop(A, B, N) instead of @btime multiplyMatrices_oneThreadLoop($A, $B, $N). This avoids issues with global variables.
Further, you do assignments instead of += in the loop. So your code did not calculate a matmul.

This way of implementing a matmul is also the obvious one but unfortunately quite slow.

See also this comment:

This version of the code gives me the desired results, second runs are not faster or slower. I reduced N to have more reasonable runtimes.

 using Base.Threads
 using BenchmarkTools
 
 function multiplyMatrices_oneThreadLoop(A::Matrix{Float64}, B::Matrix{Float64}, N::Int64)
         C = zeros(N, N)
         Threads.@threads for i in 1:N
                 for j in 1:N
                         for k in 1:N
                                 C[i, j] += A[i, k] * B[k, j]
                         end
                 end
 
         end
         return C
 end
 
 
 function multiplyMatrices_spawnExample(A::Matrix{Float64}, B::Matrix{Float64}, N::Int64)
         C = zeros(N, N)
          @sync Threads.@spawn for i in 1:N
                 for j in 1:N
                         for k in 1:N
                                 C[i, j] += A[i, k] * B[k, j]
                         end
                 end
 
         end
         return C
 end
 
 
 
 function multiplyMatrices_default(A::Matrix{Float64}, B::Matrix{Float64}, N::Int64)
         C = zeros(N,N)
         for i in 1:N
                 for j in 1:N
                         for k in 1:N
                                 C[i, j] += A[i, k] * B[k, j]
                         end
                 end
 
         end
         return C
 end
 
 N = 100
 A = rand(N, N);
 B = rand(N, N);
 println("multi-threaded loop 1st run")
 @btime multiplyMatrices_oneThreadLoop(A, B, N)
 println("using sync spawn 1st run")
 @btime multiplyMatrices_spawnExample($A,$B,$N)
 println("default multiplication 1st run")
 @btime multiplyMatrices_default($A, $B, $N)
 println("multi-threaded loop 2nd run")
 @btime multiplyMatrices_oneThreadLoop($A, $B, $N)
 println("using sync spawn 2nd run")
 @btime multiplyMatrices_spawnExample($A,$B,$N)
 println("default multiplication 2nd run")
 @btime multiplyMatrices_default($A, $B, $N)
 println("multi-threaded loop 3rd run")
 @btime multiplyMatrices_oneThreadLoop($A, $B, $N)
 println("using sync spawn 3rd run")
 @btime multiplyMatrices_spawnExample($A,$B,$N)
 println("default multiplication 3rd run")
 @btime multiplyMatrices_default($A, $B, $N)
~                                              
``
2 Likes

It’s extremely un-idiomatic and error prone to pass N as a variable. You should use something like

M,K = size(A)
KB,N = size(B)
K == KB || throw(DimensionMismatch("matrices have incompatible dimensions"))
C = zeros(M, N)

which also supports rectangular matrices. Or, if you only want the square case, insist that all the sizes are the same. Once you know that your arrays are all of the proper sizes, adding @inbounds may also improve results.


Before multithreading, it’s important to have a good (and correct – be sure to fix the aforementioned += problem) single-threaded implementation. Your initial implementation suffers tremendously from memory locality issues. Most of your time (single or multithreaded) is spent waiting on your RAM rather than doing useful work. Even just making your loops in order of j, k, i (outermost to innermost) would probably be a big improvement.

But even with that, this code will still be limited by your memory’s speed. Writing a high-performance matrix-matrix multiply is not trivial. A good introduction to modern implementations is here. A key trick is to compute the multiplication over small blocks (e.g., 4x4 in the linked example) to better utilize the cache. If you can break through the initial memory bottleneck, you’ll also see better results from C[i,j] = muladd(A[i,k], B[k,j], C[i,j]) which can combine the * and + into a single operation on most computers.

Once you have a good serial calculation, you can try to parallelize to get better performance.


If your goal is merely to explore parallelism (not matrix multiplication in particular), I’d suggest you try a different application like sorting. There’s a great example of multithreaded sorting in Julia in this blog post.

5 Likes

First I thought the same but in fact it’s not the case, threading with 24 cores gives a 10x speedup. Ofc, it still can be optimized much more compared to Julia’s *. But still, the speed-up is there:

julia> using Chairmarks

julia> function matmul(A, B)
           C = similar(A, size(A, 1), size(B, 2)) 
           fill!(C, 0)
           for i in 1:size(A, 1)
               for j in 1:size(B, 2)
                   for k in 1:size(A, 2)
                       C[i, j] += A[i, k] * B[k, j]
                   end
               end
           end 
           return C
       end
matmul (generic function with 1 method)

julia> function matmul_threaded(A, B)
           C = similar(A, size(A, 1), size(B, 2)) 
           fill!(C, 0)
           Threads.@threads for i in 1:size(A, 1)
               for j in 1:size(B, 2)
                   for k in 1:size(A, 2)
                       C[i, j] += A[i, k] * B[k, j]
                   end
               end
           end 
           return C
       end
matmul_threaded (generic function with 1 method)

julia> A = rand(1000, 700);

julia> B = rand(700, 1200);

julia> A * B ≈ matmul(A, B) ≈ matmul_threaded(A, B)
true

julia> @b matmul($A, $B)
615.996 ms (2 allocs: 9.155 MiB, without a warmup)

julia> @b matmul_threaded($A, $B)
45.856 ms (123 allocs: 9.168 MiB)

julia> @b $A * $B
4.209 ms (2 allocs: 9.155 MiB)

julia> Threads.nthreads()
24

julia> A = rand(2000, 3000);

julia> B = rand(3000, 4000);

julia> A * B ≈ matmul(A, B) ≈ matmul_threaded(A, B)

true

julia> @b matmul($A, $B)
43.364 s (2 allocs: 61.035 MiB, 0.08% gc time, without a warmup)

julia> @b matmul_threaded($A, $B)
3.911 s (123 allocs: 61.048 MiB, without a warmup)

julia> @b $A * $B
111.380 ms (2 allocs: 61.035 MiB)


1 Like

You should use @inbounds