I would like to give LoopVectorization.jl a try. As far as I understand, to speed up a computation which involves `Threads.@threads for` loop, I need to replace the macro `Threads.@threads` with `@tturbo`.

Now, I would like to ask a few questions about `@tturbo`:

• Does it use the same number of threads as `Threads.@threads` does? Let me rephrase it: if I start julia using `julia -t num_threads`, will `@tturbo` use `num_threads` number of threads?
• Let’s say I have an array the size of `num_threads` and I want each of the threads to operate on its own element of array. Can I use `Threads.threadid()` inside `@tturbo for` loop to index into array?

It will use up to `min(Threads.nthreads(), LoopVectorization.VectorizationBase.num_cores())` threads.
So if you have 8 physical cores and start Julia with 6 threads, it’ll only use up to 6 threads. If you start Julia with 16 threads, it’ll only use up to 8.

In my benchmarks, it’s been more common for me to see worse performance with `>num_cores` threads instead of better.
This could be changed in the future, e.g. with some way to characterize when we’d expect to see better performance.

But it may also use fewer threads than this.
If you want to find out how many threads it will use in the function

``````x = rand(400); y = rand(500);
A = rand(length(x),length(y));
function mydot(x,A,y)
s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
@tturbo for n in axes(A,2), m in axes(A,1)
s += x[m]*A[m,n]*y[n]
end
s
end
mydot(x,A,y) ≈ x'*A*y
``````

You can do this via

``````function mydot_ls(x,A,y)
s = zero(promote_type(eltype(x),eltype(y),eltype(A)))
LoopVectorization.@turbo_debug for n in axes(A,2), m in axes(A,1)
s += x[m]*A[m,n]*y[n]
end
end
ls = mydot_ls(x,A,y);
c = LoopVectorization.choose_order_cost(ls)[end-1];
loop_lengths = axes(A);
c / 1024^length(loop_lengths),
UInt(LoopVectorization.VectorizationBase.num_cores()),
prod(length, loop_lengths)
) % Int)
``````

Results will vary by architecture, but I get:

``````julia> max(1, LoopVectorization.choose_num_threads(
c / 1024^length(loop_lengths),
UInt(LoopVectorization.VectorizationBase.num_cores()),
prod(length, loop_lengths)
) % Int)
3
``````

The reason to use less threads is that for smallish-problems like this one, it isn’t profitable to use more threads (and this problem is memory-bound anyway).

``````julia> @benchmark LinearAlgebra.dot(\$x,\$A,\$y)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max):  50.125 μs … 114.231 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     50.552 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   51.323 μs ±   2.687 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

▄█▇▅▇▄                                             ▁▁    ▂▂▂ ▂
███████▅▃▁▄██▆▇▇▆▆▁▄▄▁▃▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▅███▇▃▁▆███ █
50.1 μs       Histogram: log(frequency) by time        62 μs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mydot(\$x,\$A,\$y)
BechmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max):  5.239 μs …  20.185 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     5.443 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   5.454 μs ± 226.437 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▁▂▂▃▃▅▇██▆▃
▂▂▃▄▅▇████████████▆▄▃▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▁▂▂▂▁▂▂▂▂ ▃
5.24 μs         Histogram: frequency by time         6.2 μs <

Memory estimate: 0 bytes, allocs estimate: 0.

BechmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min … max):   2.107 μs … 66.507 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     19.381 μs              ┊ GC (median):    0.00%
Time  (mean ± σ):   31.338 μs ± 23.901 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█     ▁                                            ▃▆▇▄
█▇▆▂▄███▆▅▄▄▄▃▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆████▇▅▄▃ ▃
2.11 μs         Histogram: frequency by time          62 μs <

Memory estimate: 441 bytes, allocs estimate: 4.
``````
1 Like

Thank you, @Elrod. I got the part about the number of threads.

But what about using `Threads.threadid()` inside `@tturbo for` loop. Do you know if it is ok?

It probably won’t work as intended. What do you want to do?

LoopVectorization assumes purity and will aggressively hoist functions. So, `Threads.threadid()`, not depending on any of the loops, will get hoisted out of them. Thus, it’d return the id of whatever thread began running the `@tturbo` code (most commonly, `1`). If hidden inside a function that does depend on a loop somehow, so that it doesn’t get hoisted, would let it actually be evaluated.

But, there’s probably a better way to do whatever you’re after?

1 Like

Basically, I need to calculate some sum. To avoid race conditions, I introduce an array of length `Threads.nthreads()` and have each of the threads update `Threads.threadid()` element of the array. Then, I aggregate the values of the array together.

So my `mydot` example. It is also calculating a sum.
LoopVectorization should handle that correctly.

That is, it uses separate accumulators on each thread, and then aggregates them when it is done.

1 Like

Also look at the `Tullio` package which uses `LoopVectorization` and allows you to code up sums easily and compactly.

By the way, do you guys know how LoopVectorization.jl partitions the array which for loop iterates over? Does the array get reshuffled before getting split into chunks?

It gets split into tiles along either 1 or 2 loops. There isn’t any reshuffling, and each thread works on a contiguous chunk of consecutive pieces of the original loop.