It looks like setting the number of threads is taking a long time because the logic that determines the BLAS vendor is not optimized away by the compiler:
julia> @btime BLAS.set_num_threads(3)
27.839 μs (0 allocations: 0 bytes)
But, the following is 5000x faster:
my_BLAS_set_num_threads(n) =
ccall((:openblas_set_num_threads64_, Base.libblas_name), Cvoid, (Int32,), n)
julia> @btime my_BLAS_set_num_threads(3)
4.894 ns (0 allocations: 0 bytes)