Operations on small matrices and BLAS in 1.6.0 RC1

With some code that uses only 4x4 matrices and that did nothing multi-threaded I was surprised to see CPU usage go up to all 4 cores on my machine (with a large percentage of system time) when using 1.6.0RC1 on Arch Linux. After some confused digging around (and disabling various parts like the GC) this turned out to be apparently caused by BLAS. One indication was the symbol blas_thread_server appeared above all julia related calls in a perf trace. Plus I saw millions of sched_yield() calls with strace on a runtime of about a minute, which doesn’t make much sense for single-threaded code.

Setting BLAS.set_num_threads(1) apparently restricts execution to a single core, and was also almost 50% faster. With 1.5.3 I don’t see this auto-threading behaviour, so perhaps some BLAS-related setting changed in 1.6?

More generally, I can understand such BLAS optimizations make sense when working with large matrices. But for small sizes like 4x4 isn’t there quite a bit of overhead in the parts of BLAS that determine whether to do such threading optimizations?

1 Like

PS. I would definitely recommend using StaticArrays for this sort of thing.

Right, although that’s a conclusion I’ve only come to in the last hour myself :slight_smile: It was a bit unexpected to see the full might of BLAS being applied to such small matrices, although the manual states clearly that it is used. I just wasn’t expecting it to be used for all matrix sizes.