Help me understand multi-threaded scaling for matrix multiplication

Thank you @Elrod for the valuable feedback.

Actually, MKL does not ignore the little cores, since checking htop shows all the CPUs > 90%. However, as you can see, I do not have any performance gain by doing that when I use 32 threads, so it seems that I should stick to a number of threads equal to the number of performance core of my machines.

Out of curiosity, could you point out at some reference to start digging more seriously on multi-threading? I had the misconception that I should be able to run at least two threads per core with the corresponding performance gain, and I would like to understand everything a little bit better.

Perhaps this might be useful to read: BLAS Tutorial ?

Very little of a core is actually doubled with multithreading.
Extra threads donā€™t add more execution units, memory bandwidth, decodeā€¦

So, if your code is bottlenecked on any of these, multithreading canā€™t help.

It only helps when your codeā€™s out of order execution is limited by dependencies between instructions. I.e., when there simply arenā€™t enough independent instructions in a thread for your CPU to be able to execute many in parallel (and that this is also a bigger issue than everything else, like memory bandwidth).

Extra threads per core can help with that because instructions of independent threads that only have register arguments are trivially independent (instructions reading/writing to memory might still be dependent, but CPUs can speculate there pretty aggressively).

Note that the performance cores on your 13900KF have a re-order buffer with 512 entries.
Thus, a single thread has a very wide window to look for instructions to execute in parallel.
With single threads searching so far and wide for opportunities already, thereā€™s little use for an extra thread. Itā€™s unlikely that the single thread still fails to find enough for out of order opportunities to be the bottleneck.

The primary culprits are pointer chasing code, which can be common in some object oriented languages like Java, but is less likely in Julia.

BLAS code definitely does not look like this. It is almost always going to be bottlenecked by execution resources (particularly for L3 BLAS) or memory bandwdith (particularly for L1 BLAS).

2 Likes