Hello, I was brought in there by the Nim v1.0 post. As someone who also wrote a BLAS from scratch in Nim (relevant details here Version 1.0 released of Nim Programming Language - #47 by mratsim), here are some notes:
- packing is a very important performance enabler, according to libxsmm over 128x128 inputs, without it you can feed the SIMD units fast enough because data will not be contiguous.
- prefetching helped my reach the last 15% of perf gap with OpenBLAS/MKL
Now on automatic cache size computation, it was also a failed experiment for me but:
- Goto and BLIS paper recommend using half of L1 cache for a panel of 1 packed matrix and half of L2 cache for another, this way the other half can be used for data movement, details here.
- On a typical PC, you share the core with plenty of applications that compete with you for cache
- Hyperthreading will mess your algorithm if 2 sibling threads try to load different panel and they will full the full L1/L2 and that will have to be discard on the next data movement.
- The BLAS implementations are all about optimizing memory-bandwidth so that it can keep up with CPU compute, but hyperthreading will double the memory bandwidth requirements to feed both HT cores.
In conclusion, you might vastly different results on a server with Hyperthreading disabled.
Note that I’m not saying hyperthreading is bad, but GEMM is an algorithm where there are known ways to fully occupy the CPU pipeline so you don’t need hardware concurrency.