Help me understand multi-threaded scaling for matrix multiplication

RayleighLord · April 14, 2024, 3:16pm

Hello,

I would like to begin by clarifying that I have very little knowledge of multithreading. I was experimenting with the following code, using the MKL library to perform matrix multiplications with a different number of threads.

using LinearAlgebra, MKL

N = 10_000
A = rand(Float32, N, N)
B = rand(Float32, N, N)

function matrix_multiplication(A, B)
    return A * B
end

N_threads = [1, 2, 4, 8, 16, 32]

for threads in N_threads
    BLAS.set_num_threads(threads)
    @time matrix_multiplication(A, B)
end

I found the following times

 11.859840 seconds (2 allocations: 381.470 MiB)
  5.999158 seconds (2 allocations: 381.470 MiB, 0.66% gc time)
  2.999295 seconds (2 allocations: 381.470 MiB)
  1.522145 seconds (2 allocations: 381.470 MiB)
  1.321285 seconds (2 allocations: 381.470 MiB, 2.63% gc time)
  1.301005 seconds (2 allocations: 381.470 MiB)

which clearly do not scale with the number of therads (when you get bigger than approximately 8, for 2 and 4 the scaling is as expected).

My versioninfo is

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900KF
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

There is clearly something I am missing due to my lack of knowledge in this field. Anyone can point me in the right direction to understand why I do not get a linear scaling with the number of threads?

PetrKryslUCSD · April 14, 2024, 3:26pm

I’d say it scales pretty well with the number of threads (12 → 6 → 3 → 1.5).
But then the system resources probably are not up to the task. How many physical cores do you have?
Edit: Sorry, just noticed the version info. So, 16 physical cores. Hence no speedup for 16 and more threads. I’d say that is as expected…

ufechner7 · April 14, 2024, 3:43pm

This CPU has 8 performance and 16 efficiency cores. This means, if you use more than 8 cores the efficiency cores get also used, and each of them does not have a good performance, so for more than 8 cores I would expect only a small performance increase, which is what you show… In addition, if you use only one core it runs with a (much) higher clock frequency than if you are using all cores due to the thermal limit, this further limits the performance scaling…

RayleighLord · April 14, 2024, 4:06pm

So there is no point on using more than one thread per core? If that is the case, why?

RayleighLord · April 14, 2024, 4:11pm

I understand that they should not have good performance, but it surprises me that after 8 threads the performance gain is totally negligible. I was not sure if this was due to memory considerations or just a limitation of the CPU, which is why I’m asking because my knowledge of multithreading is very limited

PetrKryslUCSD · April 14, 2024, 4:13pm

I’d say memory bandwidth.

jdad · April 14, 2024, 4:40pm

On the other hand, while versioninfo says 32 threads, I understand 16 threads from 8 hyperthreaded P cores plus 16 threads from 16 nonhyperthreaded E cores. So in fact real performance comes from the 8 P cores (hyperthreading them gives 16 threads of half perf).

It may also mean memory channel bandwidth is just limited to use efficiently those 8 P core, that is same argument as previous answer

ufechner7 · April 14, 2024, 6:54pm

As a reference my result using a Ryzen 7950X with 16 performance cores:

julia> include("bench.jl")
 13.257550 seconds (2 allocations: 381.470 MiB)
  6.647721 seconds (2 allocations: 381.470 MiB)
  4.437216 seconds (2 allocations: 381.470 MiB)
  1.856647 seconds (2 allocations: 381.470 MiB)
  1.229089 seconds (2 allocations: 381.470 MiB)
  1.173141 seconds (2 allocations: 381.470 MiB)

As you can see, Intel has a slightly better single core performance and a slightly worse multi core performance (as expected).
Using 16 threads gives us 10.9 times the performance…
With Intel we get 9.0 times the single core performance with 16 threads, which is to be expected because Intel does not have 16 performance cores…

I slightly modified your code to get a more reproducible result:

using LinearAlgebra, MKL

N = 10000
A = rand(Float32, N, N)
B = rand(Float32, N, N)

function matrix_multiplication(A, B)
    return A * B
end

N_threads = [1, 2, 4, 8, 16, 32]

# make sure everything is compiled before we benchmark
matrix_multiplication(A, B)
GC.enable(false)

for threads in N_threads
    BLAS.set_num_threads(threads)
    @time matrix_multiplication(A, B)
end
GC.enable(true)

ufechner7 · April 14, 2024, 7:05pm

Without using MKL I get an even better result:

 12.589237 seconds (2 allocations: 381.470 MiB)
  6.365722 seconds (2 allocations: 381.470 MiB)
  3.302038 seconds (2 allocations: 381.470 MiB)
  1.718357 seconds (2 allocations: 381.470 MiB)
  1.002300 seconds (2 allocations: 381.470 MiB)
  1.057210 seconds (2 allocations: 381.470 MiB)

About 22% faster than Intel with 16 threads…

According to the data sheet the CPU reaches 4.5 GHz if all cores are active and 5.7 GHz with one core…
So in theory 16 cores should achieve 4.5/5.7*16 = 12.6 times the performance of one core. I measured 12.6 times the performance, so the theory matches my measurements within 1% error.

PetrKryslUCSD · April 14, 2024, 7:47pm

Why would you think that? The scaling in the OP looked pretty good!

ufechner7 · April 14, 2024, 7:56pm

Yes, pretty good up to the number of fast cores (8), but above that it becomes slightly worse (as to be expected, because then the slow cores start to participate)…

RayleighLord · April 14, 2024, 8:56pm

Thank you! (The modification to disable the GC is also very helpful). So just to be clear, it does not make much sense to use more than one thread per core, since then memory bandwidth would kill the performance gain of introducing more threads right?

ufechner7 · April 14, 2024, 9:06pm

For some workloads you can get up to 15% performance gain with hyper-threading, but in most cases it does not help. In the end you must benchmark your own application. In addition more threads increase the workload on the garbage collector, so make sure to enable multi-threaded garbage collection, for example by launching Julia with:

julia --gcthreads=8,1

RayleighLord · April 14, 2024, 9:11pm

Is it not possible to enable a multithreaded GC via an enviromental variable similar to JULIA_NUM_THREADS.

Thank you for the clarifications. To be honest, if I set the number of threads to "auto" in VSCode, it always defaults to 32 threads, which is why I naively believe that I could get up to a x32 speedup when running on all of them. I thought that having 2 threads per core already meant that a single core with 2 threads could double the performance of the program (without considering hyperthreading).

The reality is that these threads share memory within a single core and hence it is not very benefitial to parallelize using a greater number of threads than cores in this case, right?

ufechner7 · April 14, 2024, 11:10pm

If you launch julia with -t auto you also get multiple gc threads, it is like --gcthreads=n,0 where n is half of the total threads. But --gcthreads=n-1,1 is usually faster.

I prefer to launch Julia via a bash script like this one: KiteControllers.jl/bin/run_julia at main · aenarete/KiteControllers.jl · GitHub .

PetrKryslUCSD · April 15, 2024, 2:18am

Aren’t we mixing together threads and threads? There are the Julia threads and the BLAS threads. I understood the OP that what was wanted was to use MKL with multiple threads. In that case, IIRC, the Julia threads play no role?

ufechner7 · April 15, 2024, 5:29am

This is not true. Even if you use BLAS threads, garbage collection still uses the Julia garbage collector and thus the number of GC threads are relevant. Not if you disable the garbage collector like in by last example, but in practice, you cannot disable it very long…

PetrKryslUCSD · April 15, 2024, 1:51pm

True.

Elrod · April 15, 2024, 11:28pm

RayleighLord:

using LinearAlgebra, MKL

N = 10_000
A = rand(Float32, N, N)
B = rand(Float32, N, N)

function matrix_multiplication(A, B)
    return A * B
end

N_threads = [1, 2, 4, 8, 16, 32]

for threads in N_threads
    BLAS.set_num_threads(threads)
    @time matrix_multiplication(A, B)
end

My Intel 10980xe (18 performance cores):

julia> for threads in N_threads
           BLAS.set_num_threads(threads); GC.gc();
           @time matrix_multiplication(A, B)
       end
  8.907036 seconds (2 allocations: 381.470 MiB)
  4.678377 seconds (2 allocations: 381.470 MiB)
  2.384124 seconds (2 allocations: 381.470 MiB)
  1.224139 seconds (2 allocations: 381.470 MiB)
  0.634274 seconds (2 allocations: 381.470 MiB)
  0.571238 seconds (2 allocations: 381.470 MiB)

Elrod · April 16, 2024, 12:18am

For performance-core only machines, I would expect scaling to be inline with the number of physical cores.

Hyperthreads are more or less useless for BLAS.
The kernels are unrolled enough so that a single core has enough out of order execution to saturate the compute units; it doesn’t need a second thread to find enough instruction level parallelism.

On the contrary, the

MKL is also smart enough to not use more threads than the number of physical cores you have available, at least on Intel.
If you tell it to use more than that, it’ll ignore you.

I’m guessing it also ignores the little cores, but I don’t have access to a hybrid CPU. @RayleighLord could potentially confirm:

julia> using LinuxPerf, MKL

julia> size(A), size(B)
((10000, 10000), (10000, 10000))

julia> BLAS.set_num_threads(36)

julia> GC.gc(); @time @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
           A * B
       end
  0.624906 seconds (35.52 k allocations: 384.297 MiB, 2.81% compilation time)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.82e+10   33.2%  #  3.6 cycles per ns
┌ instructions             9.72e+10   33.4%  #  2.5 insns per cycle
│ branch-instructions      7.04e+08   33.4%  #  0.7% of insns
└ branch-misses            6.69e+05   33.4%  #  0.1% of branch insns
┌ task-clock               1.05e+10  100.0%  # 10.5 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              9.77e+04  100.0%
┌ L1-dcache-load-misses    9.49e+09   16.7%  # 30.4% of dcache loads
│ L1-dcache-loads          3.12e+10   16.7%
└ L1-icache-load-misses    2.39e+06   16.7%
┌ dTLB-load-misses         6.26e+06   16.7%  #  0.0% of dTLB loads
└ dTLB-loads               3.12e+10   16.7%
┌ iTLB-load-misses         3.18e+04   33.4%  # 194.1% of iTLB loads
└ iTLB-loads               1.64e+04   33.4%
┌ cache-misses             2.06e+08   33.2%  # 36.7% of cache refs
└ cache-references         5.60e+08   33.2%
                 aggregated from 18 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Note that results were aggregated from only 18 threads.
Alternateively, you could watch a program like (b/h)top.

Topic		Replies	Views
Matrix multiplication is slower when multithreading in Julia Performance question , multithreading , linearalgebra	13	4079	January 21, 2022
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	381	November 13, 2023
Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads Julia at Scale fortran , parallel , linearalgebra , threads	73	1885	January 22, 2024
Matrix vector multiplication Performance question	4	878	September 27, 2020
Parallel computing with * Performance question	27	1091	December 29, 2022

Help me understand multi-threaded scaling for matrix multiplication

Related topics