With some code that uses only 4x4 matrices and that did nothing multi-threaded I was surprised to see CPU usage go up to all 4 cores on my machine (with a large percentage of system time) when using 1.6.0RC1 on Arch Linux. After some confused digging around (and disabling various parts like the GC) this turned out to be apparently caused by BLAS. One indication was the symbol blas_thread_server
appeared above all julia related calls in a perf
trace. Plus I saw millions of sched_yield()
calls with strace
on a runtime of about a minute, which doesn’t make much sense for single-threaded code.
Setting BLAS.set_num_threads(1)
apparently restricts execution to a single core, and was also almost 50% faster. With 1.5.3 I don’t see this auto-threading behaviour, so perhaps some BLAS-related setting changed in 1.6?
More generally, I can understand such BLAS optimizations make sense when working with large matrices. But for small sizes like 4x4 isn’t there quite a bit of overhead in the parts of BLAS that determine whether to do such threading optimizations?