Multithreading in LoopVectorization.jl

I would like to give LoopVectorization.jl a try. As far as I understand, to speed up a computation which involves Threads.@threads for loop, I need to replace the macro Threads.@threads with @tturbo.

Now, I would like to ask a few questions about @tturbo:

  • Does it use the same number of threads as Threads.@threads does? Let me rephrase it: if I start julia using julia -t num_threads, will @tturbo use num_threads number of threads?
  • Let’s say I have an array the size of num_threads and I want each of the threads to operate on its own element of array. Can I use Threads.threadid() inside @tturbo for loop to index into array?

It will use up to min(Threads.nthreads(), LoopVectorization.VectorizationBase.num_cores()) threads.
So if you have 8 physical cores and start Julia with 6 threads, it’ll only use up to 6 threads. If you start Julia with 16 threads, it’ll only use up to 8.

In my benchmarks, it’s been more common for me to see worse performance with >num_cores threads instead of better.
This could be changed in the future, e.g. with some way to characterize when we’d expect to see better performance.

But it may also use fewer threads than this.
If you want to find out how many threads it will use in the function

x = rand(400); y = rand(500);
A = rand(length(x),length(y));
function mydot(x,A,y)
    s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
    @tturbo for n in axes(A,2), m in axes(A,1)
        s += x[m]*A[m,n]*y[n]
    end
    s
end
mydot(x,A,y) β‰ˆ x'*A*y

You can do this via

function mydot_ls(x,A,y)
    s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
    LoopVectorization.@turbo_debug for n in axes(A,2), m in axes(A,1)
        s += x[m]*A[m,n]*y[n]
    end
end
ls = mydot_ls(x,A,y);
c = LoopVectorization.choose_order_cost(ls)[end-1];
loop_lengths = axes(A);
max(1, LoopVectorization.choose_num_threads(
    c / 1024^length(loop_lengths),
    UInt(LoopVectorization.VectorizationBase.num_cores()),
    prod(length, loop_lengths)
) % Int)

Results will vary by architecture, but I get:

julia> max(1, LoopVectorization.choose_num_threads(
           c / 1024^length(loop_lengths),
           UInt(LoopVectorization.VectorizationBase.num_cores()),
           prod(length, loop_lengths)
       ) % Int)
3

The reason to use less threads is that for smallish-problems like this one, it isn’t profitable to use more threads (and this problem is memory-bound anyway).

julia> @benchmark LinearAlgebra.dot($x,$A,$y)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  50.125 ΞΌs … 114.231 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     50.552 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   51.323 ΞΌs Β±   2.687 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„β–ˆβ–‡β–…β–‡β–„                                             ▁▁    β–‚β–‚β–‚ β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–ƒβ–β–„β–ˆβ–ˆβ–†β–‡β–‡β–†β–†β–β–„β–„β–β–ƒβ–ƒβ–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–ƒβ–…β–ˆβ–ˆβ–ˆβ–‡β–ƒβ–β–†β–ˆβ–ˆβ–ˆ β–ˆ
  50.1 ΞΌs       Histogram: log(frequency) by time        62 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mydot($x,$A,$y)
BechmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.239 ΞΌs …  20.185 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     5.443 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   5.454 ΞΌs Β± 226.437 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

        β–β–‚β–‚β–ƒβ–ƒβ–…β–‡β–ˆβ–ˆβ–†β–ƒ
  β–‚β–‚β–ƒβ–„β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–‚β–‚β–‚β–β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚ β–ƒ
  5.24 ΞΌs         Histogram: frequency by time         6.2 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark fetch(Threads.@spawn 1+1)
BechmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):   2.107 ΞΌs … 66.507 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     19.381 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   31.338 ΞΌs Β± 23.901 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ     ▁                                            β–ƒβ–†β–‡β–„
  β–ˆβ–‡β–†β–‚β–„β–ˆβ–ˆβ–ˆβ–†β–…β–„β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ƒ β–ƒ
  2.11 ΞΌs         Histogram: frequency by time          62 ΞΌs <

 Memory estimate: 441 bytes, allocs estimate: 4.
1 Like

Thank you, @Elrod. I got the part about the number of threads.

But what about using Threads.threadid() inside @tturbo for loop. Do you know if it is ok?

It probably won’t work as intended. What do you want to do?

LoopVectorization assumes purity and will aggressively hoist functions. So, Threads.threadid(), not depending on any of the loops, will get hoisted out of them. Thus, it’d return the id of whatever thread began running the @tturbo code (most commonly, 1). If hidden inside a function that does depend on a loop somehow, so that it doesn’t get hoisted, would let it actually be evaluated.

But, there’s probably a better way to do whatever you’re after?

1 Like

Basically, I need to calculate some sum. To avoid race conditions, I introduce an array of length Threads.nthreads() and have each of the threads update Threads.threadid() element of the array. Then, I aggregate the values of the array together.

So my mydot example. It is also calculating a sum.
LoopVectorization should handle that correctly.

That is, it uses separate accumulators on each thread, and then aggregates them when it is done.

1 Like

Also look at the Tullio package which uses LoopVectorization and allows you to code up sums easily and compactly.

By the way, do you guys know how LoopVectorization.jl partitions the array which for loop iterates over? Does the array get reshuffled before getting split into chunks?

It gets split into tiles along either 1 or 2 loops. There isn’t any reshuffling, and each thread works on a contiguous chunk of consecutive pieces of the original loop.