Multithreading in LoopVectorization.jl

Elrod · June 30, 2021, 12:35pm

It will use up to min(Threads.nthreads(), LoopVectorization.VectorizationBase.num_cores()) threads.
So if you have 8 physical cores and start Julia with 6 threads, it’ll only use up to 6 threads. If you start Julia with 16 threads, it’ll only use up to 8.

In my benchmarks, it’s been more common for me to see worse performance with >num_cores threads instead of better.
This could be changed in the future, e.g. with some way to characterize when we’d expect to see better performance.

But it may also use fewer threads than this.
If you want to find out how many threads it will use in the function

x = rand(400); y = rand(500);
A = rand(length(x),length(y));
function mydot(x,A,y)
    s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
    @tturbo for n in axes(A,2), m in axes(A,1)
        s += x[m]*A[m,n]*y[n]
    end
    s
end
mydot(x,A,y) ≈ x'*A*y

You can do this via

function mydot_ls(x,A,y)
    s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
    LoopVectorization.@turbo_debug for n in axes(A,2), m in axes(A,1)
        s += x[m]*A[m,n]*y[n]
    end
end
ls = mydot_ls(x,A,y);
c = LoopVectorization.choose_order_cost(ls)[end-1];
loop_lengths = axes(A);
max(1, LoopVectorization.choose_num_threads(
    c / 1024^length(loop_lengths),
    UInt(LoopVectorization.VectorizationBase.num_cores()),
    prod(length, loop_lengths)
) % Int)

Results will vary by architecture, but I get:

julia> max(1, LoopVectorization.choose_num_threads(
           c / 1024^length(loop_lengths),
           UInt(LoopVectorization.VectorizationBase.num_cores()),
           prod(length, loop_lengths)
       ) % Int)
3

The reason to use less threads is that for smallish-problems like this one, it isn’t profitable to use more threads (and this problem is memory-bound anyway).

julia> @benchmark LinearAlgebra.dot($x,$A,$y)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  50.125 μs … 114.231 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     50.552 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   51.323 μs ±   2.687 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▇▅▇▄                                             ▁▁    ▂▂▂ ▂
  ███████▅▃▁▄██▆▇▇▆▆▁▄▄▁▃▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▅███▇▃▁▆███ █
  50.1 μs       Histogram: log(frequency) by time        62 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mydot($x,$A,$y)
BechmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.239 μs …  20.185 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.443 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.454 μs ± 226.437 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▂▂▃▃▅▇██▆▃
  ▂▂▃▄▅▇████████████▆▄▃▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▁▂▂▂▁▂▂▂▂ ▃
  5.24 μs         Histogram: frequency by time         6.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark fetch(Threads.@spawn 1+1)
BechmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):   2.107 μs … 66.507 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.381 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.338 μs ± 23.901 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █     ▁                                            ▃▆▇▄
  █▇▆▂▄███▆▅▄▄▄▃▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆████▇▅▄▃ ▃
  2.11 μs         Histogram: frequency by time          62 μs <

 Memory estimate: 441 bytes, allocs estimate: 4.

Topic		Replies	Views
Can't understand what LoopVectorization is doing General Usage	7	772	September 1, 2021
VectorizationBase seems to wrongly detect the number of the physical cores New to Julia question , package	5	391	January 5, 2023
Threading, Threads.@threads vs polyester.@batch vs LoopVectorization.@tturbo General Usage multithreading , threads , loopvectorization , polyester	4	1715	July 23, 2022
LoopVectorization multithreading for multidimensional arrays Numerics loopvectorization	24	1227	March 17, 2022
Replicate @tturbo performance Performance	23	2429	August 23, 2022

Multithreading in LoopVectorization.jl

Related topics