Operations on small matrices and BLAS in 1.6.0 RC1

paulmelis · March 2, 2021, 3:56pm

With some code that uses only 4x4 matrices and that did nothing multi-threaded I was surprised to see CPU usage go up to all 4 cores on my machine (with a large percentage of system time) when using 1.6.0RC1 on Arch Linux. After some confused digging around (and disabling various parts like the GC) this turned out to be apparently caused by BLAS. One indication was the symbol blas_thread_server appeared above all julia related calls in a perf trace. Plus I saw millions of sched_yield() calls with strace on a runtime of about a minute, which doesn’t make much sense for single-threaded code.

Setting BLAS.set_num_threads(1) apparently restricts execution to a single core, and was also almost 50% faster. With 1.5.3 I don’t see this auto-threading behaviour, so perhaps some BLAS-related setting changed in 1.6?

More generally, I can understand such BLAS optimizations make sense when working with large matrices. But for small sizes like 4x4 isn’t there quite a bit of overhead in the parts of BLAS that determine whether to do such threading optimizations?

stevengj · March 2, 2021, 4:02pm

PS. I would definitely recommend using StaticArrays for this sort of thing.

paulmelis · March 2, 2021, 4:06pm

Right, although that’s a conclusion I’ve only come to in the last hour myself It was a bit unexpected to see the full might of BLAS being applied to such small matrices, although the manual states clearly that it is used. I just wasn’t expecting it to be used for all matrix sizes.

Topic		Replies	Views
Independent LU factorization of small matrices not faster with threads Performance question	10	711	October 5, 2020
Julia Threads vs BLAS threads Internals & Design	16	10963	July 26, 2018
Matrix vector multiplication Performance question	4	909	September 27, 2020
We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4) Numerics	17	13563	October 30, 2019
BLAS thread count vs Julia thread count General Usage question , performance , linearalgebra	21	2763	April 6, 2021

Operations on small matrices and BLAS in 1.6.0 RC1

Related topics