This seems to suggest that there is no improvement with using more threads with BLAS on my 8 core Apple M2 late 2023 model.
**julia>** using AppleAccelerate
**julia>** using LinearAlgebra
**julia>** BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
├ [ILP64] libopenblas64_.dylib
├ [ LP64] Accelerate
└ [ILP64] Accelerate
**julia>** BLAS.set_num_threads(1) # 1 thread
**julia>** A = rand(2000,2000);
**julia>** B = rand(2000,2000);
**julia>** A*B; # first run
**julia>** @time A*B;
0.057672 seconds (3 allocations: 30.531 MiB, 8.70% gc time)
**julia>** @time A*B;
0.111675 seconds (3 allocations: 30.531 MiB, 54.47% gc time)
**julia>** @time A*B;
0.066983 seconds (3 allocations: 30.531 MiB)
**julia>** @time A*B;
0.062971 seconds (3 allocations: 30.531 MiB, 6.35% gc time)
**julia>** BLAS.set_num_threads(4) # 4 threads
**julia>** @time A*B;
0.063720 seconds (3 allocations: 30.531 MiB, 3.26% gc time)
**julia>** @time A*B;
0.064859 seconds (3 allocations: 30.531 MiB, 0.74% gc time)
**julia>** @time A*B;
0.063915 seconds (3 allocations: 30.531 MiB, 1.90% gc time)
Is the fact that libopenblas is still listed, the problem?
1 Like
No, that’s not it—it’s the hardware. I get similar results, and with n=32, the performance is actually degraded.
A few things are going on:
The matrix is too small for many threads to help. A 2000×2000 dgemm is ~16 GFLOP of work. On Apple Silicon’s high-performance cores, a single core with AMX/NEON can already chew through this in ~15ms. The computation-to-synchronization ratio deteriorates rapidly as you add threads — the overhead of spinning up, synchronizing, and joining threads starts to dominate.
Accelerate already handles this well internally. After you loaded AppleAccelerate, the BLAS config shows Accelerate is registered alongside OpenBLAS. Apple’s Accelerate framework is heavily optimized for the M-series unified memory architecture and AMX coprocessor. It likely makes its own internal threading decisions that are already near-optimal for this problem size, which is why the 1-thread and 8-thread timings with Accelerate are essentially identical (~13–16ms). The BLAS.set_num_threads call may not even be fully respected by Accelerate’s internal scheduler.
The OpenBLAS results (before AppleAccelerate) tell the same story but worse. The anomalous spike at 16 threads (0.115s with 81% GC time) is likely a pathological interaction — OpenBLAS spawning threads that compete with Julia’s GC threads for cores. At 32 threads you’re well past the physical core count, but the timing recovers because OpenBLAS may internally cap or the scheduler settles.
What you’d see at larger sizes: Try n = 8000 or n = 16000. That’s where multi-threading starts to pay off meaningfully, because there’s enough work to amortize the coordination cost across cores.
Practical upshot: On Apple Silicon with Accelerate loaded, just leave the thread count at the default. Accelerate’s AMX-aware scheduler will do the right thing. Manually overriding BLAS.set_num_threads is mostly useful on Linux with OpenBLAS/MKL on large NUMA systems.
2 Likes
Hmm. I See. I was trying to work up to solving a large linear system with multiple threads and Krylov.jl but couldn’t get any performance improvement so tried this minimal example. You’re right about Linux and NUMA, because the same code on HPC nodes with Linux/NUMA on Intel’s Sapphire Rapids architecture makes a difference:
NUMA node(s): 8
NUMA node0 CPU(s): 0-12,104-116
NUMA node1 CPU(s): 13-25,117-129
NUMA node2 CPU(s): 26-38,130-142
NUMA node3 CPU(s): 39-51,143-155
NUMA node4 CPU(s): 52-64,156-168
NUMA node5 CPU(s): 65-77,169-181
NUMA node6 CPU(s): 78-90,182-194
NUMA node7 CPU(s): 91-103,195-207
and in Julia 1.12.4
julia> using MKL
julia> using LinearAlgebra
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
├ [ILP64] libmkl_rt.so
└ [ LP64] libmkl_rt.so
julia> BLAS.set_num_threads(1)
julia> A = rand(2000, 2000); B = rand(2000, 2000);
julia> A*B; # first time
julia> using BenchmarkTools
julia> @btime *($A, $B); # single threaded
165.446 ms (3 allocations: 30.52 MiB)
julia> BLAS.set_num_threads(4)
julia> @btime *($A, $B); # multi-threaded
44.037 ms (3 allocations: 30.52 MiB)
Interestingly, the M2 cores show similar performance of 48 ms with @btime with my single threaded experiment. I suppose I should target my lienar system solves on the NUMA machines (where they will ultimately be used) to see what works for my use case.