It will use up to min(Threads.nthreads(), LoopVectorization.VectorizationBase.num_cores())
threads.
So if you have 8 physical cores and start Julia with 6 threads, itβll only use up to 6 threads. If you start Julia with 16 threads, itβll only use up to 8.
In my benchmarks, itβs been more common for me to see worse performance with >num_cores
threads instead of better.
This could be changed in the future, e.g. with some way to characterize when weβd expect to see better performance.
But it may also use fewer threads than this.
If you want to find out how many threads it will use in the function
x = rand(400); y = rand(500);
A = rand(length(x),length(y));
function mydot(x,A,y)
s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
@tturbo for n in axes(A,2), m in axes(A,1)
s += x[m]*A[m,n]*y[n]
end
s
end
mydot(x,A,y) β x'*A*y
You can do this via
function mydot_ls(x,A,y)
s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
LoopVectorization.@turbo_debug for n in axes(A,2), m in axes(A,1)
s += x[m]*A[m,n]*y[n]
end
end
ls = mydot_ls(x,A,y);
c = LoopVectorization.choose_order_cost(ls)[end-1];
loop_lengths = axes(A);
max(1, LoopVectorization.choose_num_threads(
c / 1024^length(loop_lengths),
UInt(LoopVectorization.VectorizationBase.num_cores()),
prod(length, loop_lengths)
) % Int)
Results will vary by architecture, but I get:
julia> max(1, LoopVectorization.choose_num_threads(
c / 1024^length(loop_lengths),
UInt(LoopVectorization.VectorizationBase.num_cores()),
prod(length, loop_lengths)
) % Int)
3
The reason to use less threads is that for smallish-problems like this one, it isnβt profitable to use more threads (and this problem is memory-bound anyway).
julia> @benchmark LinearAlgebra.dot($x,$A,$y)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min β¦ max): 50.125 ΞΌs β¦ 114.231 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 50.552 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 51.323 ΞΌs Β± 2.687 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
ββ ββ βββ β
ββββββββ
ββββββββββββββββββββββββββββββββββββββββββ
ββββββββββ β
50.1 ΞΌs Histogram: log(frequency) by time 62 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark mydot($x,$A,$y)
BechmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min β¦ max): 5.239 ΞΌs β¦ 20.185 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 5.443 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 5.454 ΞΌs Β± 226.437 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
βββββ
βββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
5.24 ΞΌs Histogram: frequency by time 6.2 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark fetch(Threads.@spawn 1+1)
BechmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min β¦ max): 2.107 ΞΌs β¦ 66.507 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 19.381 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 31.338 ΞΌs Β± 23.901 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β β ββββ
ββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
ββ β
2.11 ΞΌs Histogram: frequency by time 62 ΞΌs <
Memory estimate: 441 bytes, allocs estimate: 4.