Does Julia BLAS DGEMV use multithreading?

freestatelabs · October 9, 2025, 7:31pm

Question about DGEMV called via Julia’s BLAS link:
I’m teaching myself some performance programming practices in C. I wanted to use Julia’s BLAS to benchmark my work against, for convenience. I’m developing on MacOS (M1 Pro) and X86_64 (Ubuntu).

On MacOS, I run the following script and get substantially different median execution times, telling me that DGEMV is indeed multithreaded (and reasonably efficient!) in the default BLAS setup in Julia:

using BenchmarkTools, LinearAlgebra
BLAS.set_num_threads(2)     # set between 1 and 6 for this machine (M1 Pro)
n = 10000
A = rand(n, n)
x = rand(n)
b = rand(n)
res = @benchmark mul!($b, $A, $x)
display(res)        # About 35ms for single-threaded and 19ms for 2 threads

However, when I run the same code on my x86_64 computer (9950X with 16 cores), I get basically the same median runtime for any number of BLAS threads.

Is there a bug here?

Just googling around and asking some friendly LLM friends suggests that many implementations of DGEMV don’t bother to thread it, but from my testing it seems to have quite the utility.

technocrat · October 9, 2025, 9:33pm

FWIW

12 threads set with 1 warmup

BenchmarkTools.Trial: 597 samples with 1 evaluation per sample.
 Range (min … max):  6.517 ms …  10.181 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.385 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.378 ms ± 459.505 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                           ▁▁▂▃▂  ▆▅▄█▆▃▃▅▄▁▆▇█▄▂▅▁▁ ▁         
  ▃▁▂▁▁▁▁▁▁▁▁▂▁▁▁▂▁▃▃▃▄▅▃▆▆█████▇▇██████████████████▆█▅▄▃▂▁▃▃ ▄
  6.52 ms         Histogram: frequency by time        9.41 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Fresh session, no threads set and 1 warmup

BenchmarkTools.Trial: 771 samples with 1 evaluation per sample.
 Range (min … max):  6.125 ms …   7.346 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.457 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.482 ms ± 158.762 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                       ▅▇█▅ ▁                                  
  ▁▂▃▂▂▅▄▃▄▂▂▁▂▂▂▁▂▄▅▃▆████▅█▅▅▇▄▅█▅█▇▆▇▅▃▄▃▃▄▃▃▂▄▅▄▃▃▄▃▅▄▄▃▂ ▃
  6.12 ms         Histogram: frequency by time        6.81 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

My uninformed guess is that the overhead for the BLAS outweighs the benefit gained in the case of small matrices, such as this.

Elrod · October 10, 2025, 2:21am

dgemv is memory bound and has minimal re-use opportunity. It will only benefit from multithreading to the extent that a single core is not able to make use of the full memory bandwidth available.

A single core often won’t be able to, due to various queues in the out of order engine filling up, and it being unable to issue more load instructions while still waiting on the already-issued loads to arrive.

You can use a system monitor on LinuxPerf to observe that multiple cores are in fact being used. It’s just that throwing more cores at the problem mostly just means there are more cores waiting on memory to arrive.

Topic		Replies	Views
Julia Threads vs BLAS threads Internals & Design	16	11084	July 26, 2018
Matrix vector multiplication Performance question	4	943	September 27, 2020
Slow down when running several parallel julia processes which use BLAS (MWE is provided) Performance	10	1415	January 26, 2018
BLAS fails in Julia's multithreaded mode with too many threads General Usage question , blas , hpc	4	1393	February 15, 2017
Is the Julia linear algebra multithreaded? General Usage	6	2809	February 7, 2018

Does Julia BLAS DGEMV use multithreading?

Related topics