Multithreading in LoopVectorization.jl

Gregstrq · June 30, 2021, 10:43am

I would like to give LoopVectorization.jl a try. As far as I understand, to speed up a computation which involves Threads.@threads for loop, I need to replace the macro Threads.@threads with @tturbo.

Now, I would like to ask a few questions about @tturbo:

Does it use the same number of threads as Threads.@threads does? Let me rephrase it: if I start julia using julia -t num_threads, will @tturbo use num_threads number of threads?
Let’s say I have an array the size of num_threads and I want each of the threads to operate on its own element of array. Can I use Threads.threadid() inside @tturbo for loop to index into array?

Elrod · June 30, 2021, 12:35pm

It will use up to min(Threads.nthreads(), LoopVectorization.VectorizationBase.num_cores()) threads.
So if you have 8 physical cores and start Julia with 6 threads, it’ll only use up to 6 threads. If you start Julia with 16 threads, it’ll only use up to 8.

In my benchmarks, it’s been more common for me to see worse performance with >num_cores threads instead of better.
This could be changed in the future, e.g. with some way to characterize when we’d expect to see better performance.

But it may also use fewer threads than this.
If you want to find out how many threads it will use in the function

x = rand(400); y = rand(500);
A = rand(length(x),length(y));
function mydot(x,A,y)
    s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
    @tturbo for n in axes(A,2), m in axes(A,1)
        s += x[m]*A[m,n]*y[n]
    end
    s
end
mydot(x,A,y) ≈ x'*A*y

You can do this via

function mydot_ls(x,A,y)
    s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
    LoopVectorization.@turbo_debug for n in axes(A,2), m in axes(A,1)
        s += x[m]*A[m,n]*y[n]
    end
end
ls = mydot_ls(x,A,y);
c = LoopVectorization.choose_order_cost(ls)[end-1];
loop_lengths = axes(A);
max(1, LoopVectorization.choose_num_threads(
    c / 1024^length(loop_lengths),
    UInt(LoopVectorization.VectorizationBase.num_cores()),
    prod(length, loop_lengths)
) % Int)

Results will vary by architecture, but I get:

julia> max(1, LoopVectorization.choose_num_threads(
           c / 1024^length(loop_lengths),
           UInt(LoopVectorization.VectorizationBase.num_cores()),
           prod(length, loop_lengths)
       ) % Int)
3

The reason to use less threads is that for smallish-problems like this one, it isn’t profitable to use more threads (and this problem is memory-bound anyway).

julia> @benchmark LinearAlgebra.dot($x,$A,$y)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  50.125 μs … 114.231 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     50.552 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   51.323 μs ±   2.687 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▇▅▇▄                                             ▁▁    ▂▂▂ ▂
  ███████▅▃▁▄██▆▇▇▆▆▁▄▄▁▃▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▅███▇▃▁▆███ █
  50.1 μs       Histogram: log(frequency) by time        62 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mydot($x,$A,$y)
BechmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.239 μs …  20.185 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.443 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.454 μs ± 226.437 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▂▂▃▃▅▇██▆▃
  ▂▂▃▄▅▇████████████▆▄▃▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▁▂▂▂▁▂▂▂▂ ▃
  5.24 μs         Histogram: frequency by time         6.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark fetch(Threads.@spawn 1+1)
BechmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):   2.107 μs … 66.507 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.381 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.338 μs ± 23.901 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █     ▁                                            ▃▆▇▄
  █▇▆▂▄███▆▅▄▄▄▃▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆████▇▅▄▃ ▃
  2.11 μs         Histogram: frequency by time          62 μs <

 Memory estimate: 441 bytes, allocs estimate: 4.

Gregstrq · June 30, 2021, 12:55pm

Thank you, @Elrod. I got the part about the number of threads.

But what about using Threads.threadid() inside @tturbo for loop. Do you know if it is ok?

Elrod · June 30, 2021, 12:57pm

It probably won’t work as intended. What do you want to do?

LoopVectorization assumes purity and will aggressively hoist functions. So, Threads.threadid(), not depending on any of the loops, will get hoisted out of them. Thus, it’d return the id of whatever thread began running the @tturbo code (most commonly, 1). If hidden inside a function that does depend on a loop somehow, so that it doesn’t get hoisted, would let it actually be evaluated.

But, there’s probably a better way to do whatever you’re after?

Gregstrq · June 30, 2021, 1:02pm

Basically, I need to calculate some sum. To avoid race conditions, I introduce an array of length Threads.nthreads() and have each of the threads update Threads.threadid() element of the array. Then, I aggregate the values of the array together.

Elrod · June 30, 2021, 1:03pm

So my mydot example. It is also calculating a sum.
LoopVectorization should handle that correctly.

That is, it uses separate accumulators on each thread, and then aggregates them when it is done.

Joris_Pinkse · June 30, 2021, 1:09pm

Also look at the Tullio package which uses LoopVectorization and allows you to code up sums easily and compactly.

Gregstrq · June 30, 2021, 1:15pm

By the way, do you guys know how LoopVectorization.jl partitions the array which for loop iterates over? Does the array get reshuffled before getting split into chunks?

Elrod · June 30, 2021, 1:19pm

It gets split into tiles along either 1 or 2 loops. There isn’t any reshuffling, and each thread works on a contiguous chunk of consecutive pieces of the original loop.

Topic		Replies	Views
Can't understand what LoopVectorization is doing General Usage	7	779	September 1, 2021
How to control threads in combination of LoopVectorization and @spawn Performance multithreading , loopvectorization	6	654	February 1, 2022
How to use LoopVectorization.jl? General Usage parallel , loopvectorization	7	1318	September 15, 2023
@threads for only using master thread? General Usage multithreading	7	1021	November 24, 2019
VectorizationBase seems to wrongly detect the number of the physical cores New to Julia question , package	5	391	January 5, 2023

Multithreading in LoopVectorization.jl

Related topics