We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4)

mratsim · September 26, 2019, 4:36pm

Hello, I was brought in there by the Nim v1.0 post. As someone who also wrote a BLAS from scratch in Nim (relevant details here Version 1.0 released of Nim Programming Language - #47 by mratsim), here are some notes:

packing is a very important performance enabler, according to libxsmm over 128x128 inputs, without it you can feed the SIMD units fast enough because data will not be contiguous.
prefetching helped my reach the last 15% of perf gap with OpenBLAS/MKL

Now on automatic cache size computation, it was also a failed experiment for me but:

Goto and BLIS paper recommend using half of L1 cache for a panel of 1 packed matrix and half of L2 cache for another, this way the other half can be used for data movement, details here.
On a typical PC, you share the core with plenty of applications that compete with you for cache
Hyperthreading will mess your algorithm if 2 sibling threads try to load different panel and they will full the full L1/L2 and that will have to be discard on the next data movement.
The BLAS implementations are all about optimizing memory-bandwidth so that it can keep up with CPU compute, but hyperthreading will double the memory bandwidth requirements to feed both HT cores.

In conclusion, you might vastly different results on a server with Hyperthreading disabled.
Note that I’m not saying hyperthreading is bad, but GEMM is an algorithm where there are known ways to fully occupy the CPU pipeline so you don’t need hardware concurrency.

Topic		Replies	Views
[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays Package Announcements performance , blas	32	5784	July 5, 2020
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2908	July 16, 2018
Julia matrix-multiplication performance Performance linearalgebra	20	8676	October 30, 2022
Alternate BLAS libraries? General Usage blas	22	2922	July 4, 2020
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36479	June 19, 2020

We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4)

Related topics