BLAS multithreading on 8 core Apple M2

sparrowhawk · February 12, 2026, 1:34am

This seems to suggest that there is no improvement with using more threads with BLAS on my 8 core Apple M2 late 2023 model.

**julia>** using AppleAccelerate
**julia>** using LinearAlgebra
**julia>** BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig

Libraries:

├ [ILP64] libopenblas64_.dylib
├ [ LP64] Accelerate
└ [ILP64] Accelerate

**julia>** BLAS.set_num_threads(1) # 1 thread
**julia>** A = rand(2000,2000);
**julia>** B = rand(2000,2000);
**julia>** A*B; # first run
**julia>** @time A*B;
0.057672 seconds (3 allocations: 30.531 MiB, 8.70% gc time)
**julia>** @time A*B;
0.111675 seconds (3 allocations: 30.531 MiB, 54.47% gc time)
**julia>** @time A*B;
0.066983 seconds (3 allocations: 30.531 MiB)
**julia>** @time A*B;
0.062971 seconds (3 allocations: 30.531 MiB, 6.35% gc time)
**julia>** BLAS.set_num_threads(4) # 4 threads
**julia>** @time A*B;
0.063720 seconds (3 allocations: 30.531 MiB, 3.26% gc time)
**julia>** @time A*B;
0.064859 seconds (3 allocations: 30.531 MiB, 0.74% gc time)
**julia>** @time A*B;
0.063915 seconds (3 allocations: 30.531 MiB, 1.90% gc time)

Is the fact that libopenblas is still listed, the problem?

technocrat · February 12, 2026, 2:37am

No, that’s not it—it’s the hardware. I get similar results, and with n=32, the performance is actually degraded.
A few things are going on:

The matrix is too small for many threads to help. A 2000×2000 dgemm is ~16 GFLOP of work. On Apple Silicon’s high-performance cores, a single core with AMX/NEON can already chew through this in ~15ms. The computation-to-synchronization ratio deteriorates rapidly as you add threads — the overhead of spinning up, synchronizing, and joining threads starts to dominate.

Accelerate already handles this well internally. After you loaded AppleAccelerate, the BLAS config shows Accelerate is registered alongside OpenBLAS. Apple’s Accelerate framework is heavily optimized for the M-series unified memory architecture and AMX coprocessor. It likely makes its own internal threading decisions that are already near-optimal for this problem size, which is why the 1-thread and 8-thread timings with Accelerate are essentially identical (~13–16ms). The BLAS.set_num_threads call may not even be fully respected by Accelerate’s internal scheduler.

The OpenBLAS results (before AppleAccelerate) tell the same story but worse. The anomalous spike at 16 threads (0.115s with 81% GC time) is likely a pathological interaction — OpenBLAS spawning threads that compete with Julia’s GC threads for cores. At 32 threads you’re well past the physical core count, but the timing recovers because OpenBLAS may internally cap or the scheduler settles.

What you’d see at larger sizes: Try n = 8000 or n = 16000. That’s where multi-threading starts to pay off meaningfully, because there’s enough work to amortize the coordination cost across cores.

Practical upshot: On Apple Silicon with Accelerate loaded, just leave the thread count at the default. Accelerate’s AMX-aware scheduler will do the right thing. Manually overriding BLAS.set_num_threads is mostly useful on Linux with OpenBLAS/MKL on large NUMA systems.

sparrowhawk · February 12, 2026, 3:54am

Hmm. I See. I was trying to work up to solving a large linear system with multiple threads and Krylov.jl but couldn’t get any performance improvement so tried this minimal example. You’re right about Linux and NUMA, because the same code on HPC nodes with Linux/NUMA on Intel’s Sapphire Rapids architecture makes a difference:

NUMA node(s):        8
NUMA node0 CPU(s):   0-12,104-116
NUMA node1 CPU(s):   13-25,117-129
NUMA node2 CPU(s):   26-38,130-142
NUMA node3 CPU(s):   39-51,143-155
NUMA node4 CPU(s):   52-64,156-168
NUMA node5 CPU(s):   65-77,169-181
NUMA node6 CPU(s):   78-90,182-194
NUMA node7 CPU(s):   91-103,195-207

and in Julia 1.12.4

julia> using MKL
julia> using LinearAlgebra

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
├ [ILP64] libmkl_rt.so
└ [ LP64] libmkl_rt.so

julia> BLAS.set_num_threads(1)

julia> A = rand(2000, 2000); B = rand(2000, 2000);

julia> A*B; # first time

julia> using BenchmarkTools

julia> @btime *($A, $B); # single threaded
  165.446 ms (3 allocations: 30.52 MiB)

julia> BLAS.set_num_threads(4)

julia> @btime *($A, $B); # multi-threaded
  44.037 ms (3 allocations: 30.52 MiB)

Interestingly, the M2 cores show similar performance of 48 ms with @btime with my single threaded experiment. I suppose I should target my lienar system solves on the NUMA machines (where they will ultimately be used) to see what works for my use case.

Topic		Replies	Views
Mac M2 Ultra brought to its knees General Usage	18	910	June 10, 2025
Ideal number of BLAS threads General Usage blas , multithreading , linearalgebra	10	4759	April 27, 2022
Julia under rosetta 2 on mac m1: threading/scheduling issues with openblas? Internals & Design mac-m1	2	1082	November 26, 2021
BLAS thread count vs Julia thread count General Usage question , performance , linearalgebra	21	2985	April 6, 2021
BLAS fails in Julia's multithreaded mode with too many threads General Usage question , blas , hpc	4	1405	February 15, 2017

BLAS multithreading on 8 core Apple M2

Related topics