Actually, MKL does not ignore the little cores, since checking htop shows all the CPUs > 90%. However, as you can see, I do not have any performance gain by doing that when I use 32 threads, so it seems that I should stick to a number of threads equal to the number of performance core of my machines.
Out of curiosity, could you point out at some reference to start digging more seriously on multi-threading? I had the misconception that I should be able to run at least two threads per core with the corresponding performance gain, and I would like to understand everything a little bit better.
Very little of a core is actually doubled with multithreading.
Extra threads donāt add more execution units, memory bandwidth, decodeā¦
So, if your code is bottlenecked on any of these, multithreading canāt help.
It only helps when your codeās out of order execution is limited by dependencies between instructions. I.e., when there simply arenāt enough independent instructions in a thread for your CPU to be able to execute many in parallel (and that this is also a bigger issue than everything else, like memory bandwidth).
Extra threads per core can help with that because instructions of independent threads that only have register arguments are trivially independent (instructions reading/writing to memory might still be dependent, but CPUs can speculate there pretty aggressively).
Note that the performance cores on your 13900KF have a re-order buffer with 512 entries.
Thus, a single thread has a very wide window to look for instructions to execute in parallel.
With single threads searching so far and wide for opportunities already, thereās little use for an extra thread. Itās unlikely that the single thread still fails to find enough for out of order opportunities to be the bottleneck.
The primary culprits are pointer chasing code, which can be common in some object oriented languages like Java, but is less likely in Julia.
BLAS code definitely does not look like this. It is almost always going to be bottlenecked by execution resources (particularly for L3 BLAS) or memory bandwdith (particularly for L1 BLAS).