Julia Threads vs BLAS threads

It looks like setting the number of threads is taking a long time because the logic that determines the BLAS vendor is not optimized away by the compiler:

julia> @btime BLAS.set_num_threads(3)
  27.839 μs (0 allocations: 0 bytes)

But, the following is 5000x faster:

my_BLAS_set_num_threads(n) =
   ccall((:openblas_set_num_threads64_, Base.libblas_name), Cvoid, (Int32,), n)

julia> @btime my_BLAS_set_num_threads(3)
  4.894 ns (0 allocations: 0 bytes)
5 Likes