Multithreading in LoopVectorization.jl

It will use up to min(Threads.nthreads(), LoopVectorization.VectorizationBase.num_cores()) threads.
So if you have 8 physical cores and start Julia with 6 threads, it’ll only use up to 6 threads. If you start Julia with 16 threads, it’ll only use up to 8.

In my benchmarks, it’s been more common for me to see worse performance with >num_cores threads instead of better.
This could be changed in the future, e.g. with some way to characterize when we’d expect to see better performance.

But it may also use fewer threads than this.
If you want to find out how many threads it will use in the function

x = rand(400); y = rand(500);
A = rand(length(x),length(y));
function mydot(x,A,y)
    s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
    @tturbo for n in axes(A,2), m in axes(A,1)
        s += x[m]*A[m,n]*y[n]
    end
    s
end
mydot(x,A,y) ≈ x'*A*y

You can do this via

function mydot_ls(x,A,y)
    s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
    LoopVectorization.@turbo_debug for n in axes(A,2), m in axes(A,1)
        s += x[m]*A[m,n]*y[n]
    end
end
ls = mydot_ls(x,A,y);
c = LoopVectorization.choose_order_cost(ls)[end-1];
loop_lengths = axes(A);
max(1, LoopVectorization.choose_num_threads(
    c / 1024^length(loop_lengths),
    UInt(LoopVectorization.VectorizationBase.num_cores()),
    prod(length, loop_lengths)
) % Int)

Results will vary by architecture, but I get:

julia> max(1, LoopVectorization.choose_num_threads(
           c / 1024^length(loop_lengths),
           UInt(LoopVectorization.VectorizationBase.num_cores()),
           prod(length, loop_lengths)
       ) % Int)
3

The reason to use less threads is that for smallish-problems like this one, it isn’t profitable to use more threads (and this problem is memory-bound anyway).

julia> @benchmark LinearAlgebra.dot($x,$A,$y)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  50.125 μs … 114.231 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     50.552 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   51.323 μs ±   2.687 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▇▅▇▄                                             ▁▁    ▂▂▂ ▂
  ███████▅▃▁▄██▆▇▇▆▆▁▄▄▁▃▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▅███▇▃▁▆███ █
  50.1 μs       Histogram: log(frequency) by time        62 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mydot($x,$A,$y)
BechmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.239 μs …  20.185 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.443 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.454 μs ± 226.437 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▂▂▃▃▅▇██▆▃
  ▂▂▃▄▅▇████████████▆▄▃▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▁▂▂▂▁▂▂▂▂ ▃
  5.24 μs         Histogram: frequency by time         6.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark fetch(Threads.@spawn 1+1)
BechmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):   2.107 μs … 66.507 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.381 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.338 μs ± 23.901 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █     ▁                                            ▃▆▇▄
  █▇▆▂▄███▆▅▄▄▄▃▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆████▇▅▄▃ ▃
  2.11 μs         Histogram: frequency by time          62 μs <

 Memory estimate: 441 bytes, allocs estimate: 4.
1 Like