With some code that uses only 4x4 matrices and that did nothing multi-threaded I was surprised to see CPU usage go up to all 4 cores on my machine (with a large percentage of system time) when using 1.6.0RC1 on Arch Linux. After some confused digging around (and disabling various parts like the GC) this turned out to be apparently caused by BLAS. One indication was the symbol
blas_thread_server appeared above all julia related calls in a
perf trace. Plus I saw millions of
sched_yield() calls with
strace on a runtime of about a minute, which doesn’t make much sense for single-threaded code.
BLAS.set_num_threads(1) apparently restricts execution to a single core, and was also almost 50% faster. With 1.5.3 I don’t see this auto-threading behaviour, so perhaps some BLAS-related setting changed in 1.6?
More generally, I can understand such BLAS optimizations make sense when working with large matrices. But for small sizes like 4x4 isn’t there quite a bit of overhead in the parts of BLAS that determine whether to do such threading optimizations?