Question about DGEMV called via Julia’s BLAS link:
I’m teaching myself some performance programming practices in C. I wanted to use Julia’s BLAS to benchmark my work against, for convenience. I’m developing on MacOS (M1 Pro) and X86_64 (Ubuntu).
On MacOS, I run the following script and get substantially different median execution times, telling me that DGEMV is indeed multithreaded (and reasonably efficient!) in the default BLAS setup in Julia:
using BenchmarkTools, LinearAlgebra
BLAS.set_num_threads(2) # set between 1 and 6 for this machine (M1 Pro)
n = 10000
A = rand(n, n)
x = rand(n)
b = rand(n)
res = @benchmark mul!($b, $A, $x)
display(res) # About 35ms for single-threaded and 19ms for 2 threads
However, when I run the same code on my x86_64 computer (9950X with 16 cores), I get basically the same median runtime for any number of BLAS threads.
Is there a bug here?
Just googling around and asking some friendly LLM friends suggests that many implementations of DGEMV don’t bother to thread it, but from my testing it seems to have quite the utility.
BenchmarkTools.Trial: 597 samples with 1 evaluation per sample.
Range (min … max): 6.517 ms … 10.181 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.385 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.378 ms ± 459.505 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▂▃▂ ▆▅▄█▆▃▃▅▄▁▆▇█▄▂▅▁▁ ▁
▃▁▂▁▁▁▁▁▁▁▁▂▁▁▁▂▁▃▃▃▄▅▃▆▆█████▇▇██████████████████▆█▅▄▃▂▁▃▃ ▄
6.52 ms Histogram: frequency by time 9.41 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Fresh session, no threads set and 1 warmup
BenchmarkTools.Trial: 771 samples with 1 evaluation per sample.
Range (min … max): 6.125 ms … 7.346 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.457 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.482 ms ± 158.762 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▇█▅ ▁
▁▂▃▂▂▅▄▃▄▂▂▁▂▂▂▁▂▄▅▃▆████▅█▅▅▇▄▅█▅█▇▆▇▅▃▄▃▃▄▃▃▂▄▅▄▃▃▄▃▅▄▄▃▂ ▃
6.12 ms Histogram: frequency by time 6.81 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
My uninformed guess is that the overhead for the BLAS outweighs the benefit gained in the case of small matrices, such as this.
dgemv is memory bound and has minimal re-use opportunity. It will only benefit from multithreading to the extent that a single core is not able to make use of the full memory bandwidth available.
A single core often won’t be able to, due to various queues in the out of order engine filling up, and it being unable to issue more load instructions while still waiting on the already-issued loads to arrive.
You can use a system monitor on LinuxPerf to observe that multiple cores are in fact being used. It’s just that throwing more cores at the problem mostly just means there are more cores waiting on memory to arrive.