I am doing a dense matrix-vector multiplication in a 64 core workstation. The order of the matrix is 11000. By default, BLAS is using only 8 threads. I have changed both the BLAS & JULIA(1.5.1) NUM_THREADS and getting the best performance when BLAS.set_num_threads(1). It looks that matrix-vector multiplication is not using the multi-threading option. The pice of my code is
a=rand(11000,11000);
b=rand(11000);
c=similar(b);
BLAS.set_num_threads(n) @btime mul!(c,a,b)
I am attaching the time taken for various n
export JULIA_NUM_THREADS=1
BLAS.set_num_threads(1) 89.728ms
BLAS.set_num_threads(2) 145.720ms
BLAS.set_num_threads(4) 142.448ms
BLAS.set_num_threads(8) 141.081ms
BLAS.set_num_threads(16) 159.111ms
BLAS.set_num_threads(32) 177.843ms
BLAS.set_num_threads(64) 178.613ms
export JULIA_NUM_THREADS=32
BLAS.set_num_threads(1) 92.359ms
BLAS.set_num_threads(2) 100.988ms
BLAS.set_num_threads(4) 142.888ms
BLAS.set_num_threads(8) 141.630ms
BLAS.set_num_threads(16) 164.816ms
BLAS.set_num_threads(32) 191.399ms
BLAS.set_num_threads(64) 201.963ms
export JULIA_NUM_THREADS=64
BLAS.set_num_threads(1) 93.413ms
BLAS.set_num_threads(2) 102.900ms
BLAS.set_num_threads(4) 230.146ms
BLAS.set_num_threads(8) 146.121ms
BLAS.set_num_threads(16) 164.659ms
BLAS.set_num_threads(32) 193.533ms
BLAS.set_num_threads(64) 193.288ms
I hope there must be any way so that it can properly use mult-thread option.
julia> using LinearAlgebra
julia> a=rand(11000,11000);
julia> b=rand(11000);
julia> c=similar(b);
julia> for n in 1:(Sys.CPU_THREADS ÷ 2)
BLAS.set_num_threads(n)
println("Number threads: $n")
@btime mul!($c, $a, $b)
end
Number threads: 1
65.269 ms (0 allocations: 0 bytes)
Number threads: 2
34.856 ms (0 allocations: 0 bytes)
Number threads: 3
24.138 ms (0 allocations: 0 bytes)
Number threads: 4
18.899 ms (0 allocations: 0 bytes)
Number threads: 5
16.145 ms (0 allocations: 0 bytes)
Number threads: 6
14.530 ms (0 allocations: 0 bytes)
Number threads: 7
13.619 ms (0 allocations: 0 bytes)
Number threads: 8
13.118 ms (0 allocations: 0 bytes)
Number threads: 9
12.750 ms (0 allocations: 0 bytes)
Number threads: 10
12.634 ms (0 allocations: 0 bytes)
Number threads: 11
12.496 ms (0 allocations: 0 bytes)
Number threads: 12
12.584 ms (0 allocations: 0 bytes)
Number threads: 13
12.567 ms (0 allocations: 0 bytes)
Number threads: 14
12.680 ms (0 allocations: 0 bytes)
Number threads: 15
12.718 ms (0 allocations: 0 bytes)
Number threads: 16
12.701 ms (0 allocations: 0 bytes)
Number threads: 17
12.753 ms (0 allocations: 0 bytes)
Number threads: 18
12.796 ms (0 allocations: 0 bytes)
julia> @btime sum($a)
70.643 ms (0 allocations: 0 bytes)
6.0501937025733836e7
Interesting that single threaded mul!(c, a, b) is faster than just sum(a).
gemv is dominated by memory bandwidth. We only have O(1) operations relative to the size of a, so we’re limited by memory speed if a doesn’t fit in cache.
When single threaded, it got 3.7 GFLOPS. When using all cores, it at best got around 19.
These numbers are very bad; this CPU hits roughly 125 GFLOPS/thread in gemm benchmarks.
Using more cores can increase total memory bandwidth up to a point, but only so far, which is why I see good scaling with the first few extra threads, but bad scaling after that.
Memory performance is all that matters for performance here.
I’m guessing your 64 core workstation has some sort of split cache, e.g. either being a threadripper/Epyc chip, or a multisocket system. And that this is causing problems by increasing overhead.
But I don’t think that necessarily has to be the case, so I’m not really sure. You should be able to split the rows of a 64-ways, and these threads should have no reason to transfer any memory between them.
I have used the @strided option, but it took 1.8993 sec, which is much greater than the multiplication using BLAS. I also tried the Strided.enable_threaded_mul() option but did not find any improvements.
Thank you for your help. In my case, I get a good scaling up to 4 threads in another workstation. Maybe there is an issue with old-hardware in the previous workstation.
lscpu command shows the system information as bellow,
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 4
Please elaborate on the splitting and transfer mechanism to the threads so that the overhead issue can be minimized. I guess you are talking about the point. I may be wrong.
julia> using LinearAlgebra,Base.Threads,BenchmarkTools
julia> a=rand(11000,11000);
julia> b=rand(11000);
julia> const c=fill(0.0,(11000));
julia> const c=fill(0.0,(11000));
julia> function nthread_matmul!()
Threads.@threads for i in 1:11000
c[i]=dot(a[i,:],b)
end
end
julia> BLAS.set_num_threads(1);
julia> @btime nthread_matmul!()
Anyway I need to deal the matrix of order > 50000*50000