Help me understand multi-threaded scaling for matrix multiplication

Hello,

I would like to begin by clarifying that I have very little knowledge of multithreading. I was experimenting with the following code, using the MKL library to perform matrix multiplications with a different number of threads.

using LinearAlgebra, MKL

N = 10_000
A = rand(Float32, N, N)
B = rand(Float32, N, N)

function matrix_multiplication(A, B)
    return A * B
end

N_threads = [1, 2, 4, 8, 16, 32]

for threads in N_threads
    BLAS.set_num_threads(threads)
    @time matrix_multiplication(A, B)
end

I found the following times

 11.859840 seconds (2 allocations: 381.470 MiB)
  5.999158 seconds (2 allocations: 381.470 MiB, 0.66% gc time)
  2.999295 seconds (2 allocations: 381.470 MiB)
  1.522145 seconds (2 allocations: 381.470 MiB)
  1.321285 seconds (2 allocations: 381.470 MiB, 2.63% gc time)
  1.301005 seconds (2 allocations: 381.470 MiB)

which clearly do not scale with the number of therads (when you get bigger than approximately 8, for 2 and 4 the scaling is as expected).

My versioninfo is

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 Γ— 13th Gen Intel(R) Core(TM) i9-13900KF
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

There is clearly something I am missing due to my lack of knowledge in this field. Anyone can point me in the right direction to understand why I do not get a linear scaling with the number of threads?

I’d say it scales pretty well with the number of threads (12 β†’ 6 β†’ 3 β†’ 1.5).
But then the system resources probably are not up to the task. How many physical cores do you have?
Edit: Sorry, just noticed the version info. So, 16 physical cores. Hence no speedup for 16 and more threads. I’d say that is as expected…

1 Like

This CPU has 8 performance and 16 efficiency cores. This means, if you use more than 8 cores the efficiency cores get also used, and each of them does not have a good performance, so for more than 8 cores I would expect only a small performance increase, which is what you show… In addition, if you use only one core it runs with a (much) higher clock frequency than if you are using all cores due to the thermal limit, this further limits the performance scaling…

5 Likes

So there is no point on using more than one thread per core? If that is the case, why?

I understand that they should not have good performance, but it surprises me that after 8 threads the performance gain is totally negligible. I was not sure if this was due to memory considerations or just a limitation of the CPU, which is why I’m asking because my knowledge of multithreading is very limited

I’d say memory bandwidth.

On the other hand, while versioninfo says 32 threads, I understand 16 threads from 8 hyperthreaded P cores plus 16 threads from 16 nonhyperthreaded E cores. So in fact real performance comes from the 8 P cores (hyperthreading them gives 16 threads of half perf).

It may also mean memory channel bandwidth is just limited to use efficiently those 8 P core, that is same argument as previous answer

As a reference my result using a Ryzen 7950X with 16 performance cores:

julia> include("bench.jl")
 13.257550 seconds (2 allocations: 381.470 MiB)
  6.647721 seconds (2 allocations: 381.470 MiB)
  4.437216 seconds (2 allocations: 381.470 MiB)
  1.856647 seconds (2 allocations: 381.470 MiB)
  1.229089 seconds (2 allocations: 381.470 MiB)
  1.173141 seconds (2 allocations: 381.470 MiB)

As you can see, Intel has a slightly better single core performance and a slightly worse multi core performance (as expected).
Using 16 threads gives us 10.9 times the performance…
With Intel we get 9.0 times the single core performance with 16 threads, which is to be expected because Intel does not have 16 performance cores…

I slightly modified your code to get a more reproducible result:

using LinearAlgebra, MKL

N = 10000
A = rand(Float32, N, N)
B = rand(Float32, N, N)

function matrix_multiplication(A, B)
    return A * B
end

N_threads = [1, 2, 4, 8, 16, 32]

# make sure everything is compiled before we benchmark
matrix_multiplication(A, B)
GC.enable(false)

for threads in N_threads
    BLAS.set_num_threads(threads)
    @time matrix_multiplication(A, B)
end
GC.enable(true)
1 Like

Without using MKL I get an even better result:

 12.589237 seconds (2 allocations: 381.470 MiB)
  6.365722 seconds (2 allocations: 381.470 MiB)
  3.302038 seconds (2 allocations: 381.470 MiB)
  1.718357 seconds (2 allocations: 381.470 MiB)
  1.002300 seconds (2 allocations: 381.470 MiB)
  1.057210 seconds (2 allocations: 381.470 MiB)

About 22% faster than Intel with 16 threads…

According to the data sheet the CPU reaches 4.5 GHz if all cores are active and 5.7 GHz with one core…
So in theory 16 cores should achieve 4.5/5.7*16 = 12.6 times the performance of one core. I measured 12.6 times the performance, so the theory matches my measurements within 1% error.

1 Like

Why would you think that? The scaling in the OP looked pretty good!

Yes, pretty good up to the number of fast cores (8), but above that it becomes slightly worse (as to be expected, because then the slow cores start to participate)…

Thank you! (The modification to disable the GC is also very helpful). So just to be clear, it does not make much sense to use more than one thread per core, since then memory bandwidth would kill the performance gain of introducing more threads right?

For some workloads you can get up to 15% performance gain with hyper-threading, but in most cases it does not help. In the end you must benchmark your own application. In addition more threads increase the workload on the garbage collector, so make sure to enable multi-threaded garbage collection, for example by launching Julia with:

julia --gcthreads=8,1

Is it not possible to enable a multithreaded GC via an enviromental variable similar to JULIA_NUM_THREADS.

Thank you for the clarifications. To be honest, if I set the number of threads to "auto" in VSCode, it always defaults to 32 threads, which is why I naively believe that I could get up to a x32 speedup when running on all of them. I thought that having 2 threads per core already meant that a single core with 2 threads could double the performance of the program (without considering hyperthreading).

The reality is that these threads share memory within a single core and hence it is not very benefitial to parallelize using a greater number of threads than cores in this case, right?

If you launch julia with -t auto you also get multiple gc threads, it is like --gcthreads=n,0 where n is half of the total threads. But --gcthreads=n-1,1 is usually faster.

I prefer to launch Julia via a bash script like this one: KiteControllers.jl/bin/run_julia at main Β· aenarete/KiteControllers.jl Β· GitHub .

Aren’t we mixing together threads and threads? There are the Julia threads and the BLAS threads. I understood the OP that what was wanted was to use MKL with multiple threads. In that case, IIRC, the Julia threads play no role?

This is not true. Even if you use BLAS threads, garbage collection still uses the Julia garbage collector and thus the number of GC threads are relevant. Not if you disable the garbage collector like in by last example, but in practice, you cannot disable it very long…

True.

My Intel 10980xe (18 performance cores):

julia> for threads in N_threads
           BLAS.set_num_threads(threads); GC.gc();
           @time matrix_multiplication(A, B)
       end
  8.907036 seconds (2 allocations: 381.470 MiB)
  4.678377 seconds (2 allocations: 381.470 MiB)
  2.384124 seconds (2 allocations: 381.470 MiB)
  1.224139 seconds (2 allocations: 381.470 MiB)
  0.634274 seconds (2 allocations: 381.470 MiB)
  0.571238 seconds (2 allocations: 381.470 MiB)

For performance-core only machines, I would expect scaling to be inline with the number of physical cores.

Hyperthreads are more or less useless for BLAS.
The kernels are unrolled enough so that a single core has enough out of order execution to saturate the compute units; it doesn’t need a second thread to find enough instruction level parallelism.

On the contrary, the

MKL is also smart enough to not use more threads than the number of physical cores you have available, at least on Intel.
If you tell it to use more than that, it’ll ignore you.

I’m guessing it also ignores the little cores, but I don’t have access to a hybrid CPU. @RayleighLord could potentially confirm:

julia> using LinuxPerf, MKL

julia> size(A), size(B)
((10000, 10000), (10000, 10000))

julia> BLAS.set_num_threads(36)

julia> GC.gc(); @time @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
           A * B
       end
  0.624906 seconds (35.52 k allocations: 384.297 MiB, 2.81% compilation time)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               3.82e+10   33.2%  #  3.6 cycles per ns
β”Œ instructions             9.72e+10   33.4%  #  2.5 insns per cycle
β”‚ branch-instructions      7.04e+08   33.4%  #  0.7% of insns
β”” branch-misses            6.69e+05   33.4%  #  0.1% of branch insns
β”Œ task-clock               1.05e+10  100.0%  # 10.5 s
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              9.77e+04  100.0%
β”Œ L1-dcache-load-misses    9.49e+09   16.7%  # 30.4% of dcache loads
β”‚ L1-dcache-loads          3.12e+10   16.7%
β”” L1-icache-load-misses    2.39e+06   16.7%
β”Œ dTLB-load-misses         6.26e+06   16.7%  #  0.0% of dTLB loads
β”” dTLB-loads               3.12e+10   16.7%
β”Œ iTLB-load-misses         3.18e+04   33.4%  # 194.1% of iTLB loads
β”” iTLB-loads               1.64e+04   33.4%
β”Œ cache-misses             2.06e+08   33.2%  # 36.7% of cache refs
β”” cache-references         5.60e+08   33.2%
                 aggregated from 18 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Note that results were aggregated from only 18 threads.
Alternateively, you could watch a program like (b/h)top.

1 Like