Help me understand multi-threaded scaling for matrix multiplication

RayleighLord · April 16, 2024, 4:33pm

Thank you @Elrod for the valuable feedback.

Actually, MKL does not ignore the little cores, since checking htop shows all the CPUs > 90%. However, as you can see, I do not have any performance gain by doing that when I use 32 threads, so it seems that I should stick to a number of threads equal to the number of performance core of my machines.

Out of curiosity, could you point out at some reference to start digging more seriously on multi-threading? I had the misconception that I should be able to run at least two threads per core with the corresponding performance gain, and I would like to understand everything a little bit better.

ufechner7 · April 16, 2024, 5:56pm

Perhaps this might be useful to read: BLAS Tutorial ?

Elrod · April 16, 2024, 6:29pm

Very little of a core is actually doubled with multithreading.
Extra threads don’t add more execution units, memory bandwidth, decode…

So, if your code is bottlenecked on any of these, multithreading can’t help.

It only helps when your code’s out of order execution is limited by dependencies between instructions. I.e., when there simply aren’t enough independent instructions in a thread for your CPU to be able to execute many in parallel (and that this is also a bigger issue than everything else, like memory bandwidth).

Extra threads per core can help with that because instructions of independent threads that only have register arguments are trivially independent (instructions reading/writing to memory might still be dependent, but CPUs can speculate there pretty aggressively).

Note that the performance cores on your 13900KF have a re-order buffer with 512 entries.
Thus, a single thread has a very wide window to look for instructions to execute in parallel.
With single threads searching so far and wide for opportunities already, there’s little use for an extra thread. It’s unlikely that the single thread still fails to find enough for out of order opportunities to be the bottleneck.

The primary culprits are pointer chasing code, which can be common in some object oriented languages like Java, but is less likely in Julia.

BLAS code definitely does not look like this. It is almost always going to be bottlenecked by execution resources (particularly for L3 BLAS) or memory bandwdith (particularly for L1 BLAS).

Topic		Replies	Views
Matrix multiplication is slower when multithreading in Julia Performance question , multithreading , linearalgebra	13	4153	January 21, 2022
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	408	November 13, 2023
Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads Julia at Scale fortran , parallel , linearalgebra , threads	73	1989	January 22, 2024
Matrix vector multiplication Performance question	4	899	September 27, 2020
Parallel computing with * Performance question	27	1111	December 29, 2022

Help me understand multi-threaded scaling for matrix multiplication

Related topics