# How to use LoopVectorization.jl?

I have a question regarding LoopVectorization.jl. I’ve been trying to integrate it into my project but am stuck with some errors. Here is a MWE that reproduces it:

``````function F!(xllp ::Vector{Vector{Int}}, xllw ::Vector{Vector{Float64}}, yllp ::Vector{Vector{Int}}, yllw ::Vector{Vector{Float64}}, Zp ::Array{Int,2}, Zw ::Array{Float64,2}) ::Nothing
m, n = size(Zw,1), length(xllp)
@tturbo for k=1:n
for i=1:m  Zw[i,t]=0 end   # clear from previous computation
for j=1:length(xllp[k])
v, w = xllp[k][j], xllw[k][j]
Zw[v, t] += w end   # sum weights with same position
i=1
for ii=1:m   # recover sorted unique positions
Zp[i,t] = ii
Zw[i,t] = Zw[ii,t]
i += (Zw[ii,t]==0 ? 0 : 1) end
i-=1
yllp[k], yllw[k] = Zp[1:i,t], Zw[1:i,t]   # store result permanently
end; return nothing end;

xllp = [rand(1:m,rand(0:s)) for j=1:n]
xllw = [rand(Float64,length(l)) for l in xllp]
yllp, yllw = Vector{Vector{Int}}(undef,n), Vector{Vector{Float64}}(undef,n)
Zp, Zw = fill(Int(0),m,nt), fill(Float64(0),m,nt)  # temporary arrays for calculations
F!(xllp,xllw,yllp,yllw,Zp,Zw)  # run F! on toy data
``````

If you run this, you’ll get:

``````ERROR: LoadError: UndefVarError: `k` not defined
Stacktrace:
[1] F!(xllp::Vector{Vector{Int64}}, xllw::Vector{Vector{Float64}}, yllp::Vector{Vector{Int64}}, yllw::Vector{Vector{Float64}}, Zp::Matrix{Int64}, Zw::Matrix{Float64})
@ Main E:\Dropbox\programming_Julia\Q06\Q06.jl:6
[2] top-level scope
@ E:\Dropbox\programming_Julia\Q06\Q06.jl:26
[3] include(fname::String)
@ Base.MainInclude .\client.jl:478
[4] top-level scope
@ REPL[10]:1
in expression starting at E:\Dropbox\programming_Julia\Q06\Q06.jl:26
``````

However, running the same code without `@tturbo` works as desired. Any idea on what the issue is? Is there a way to reorganize my code to be accepted by `@tturbo` ?

I can’t tell you how to fix it, but you might be hitting a (current) limitation of LoopVectorization. From the

The macro assumes that loop iterations can be reordered. It also currently supports simple nested loops, where loop bounds of inner loops are constant across iterations of the outer loop, and only a single loop at each level of loop nest. These limitations should be removed in a future version.

2 Likes

Aha, so it’s best to wait until LoopVectorization supports inner loops of different lengths. Ok.

Regarding `threadid()`, thank you for pointing out the danger of my pattern. However, I’m confused by your link. As I understand, my code above will potentially give incorrect results? Then how could I achieve what this function does, with `Threads.@threads` instead of `@tturbo`?

The blog post suggests using `Task`s instead of `@threads`. Then when is using the latter even safe, and why does `@threads` exist then?? I’m not sure how to use `@spawn` in my case, as it’s a new macro for me.

You could adopt the `parition` example from the linked PSA post and write something like

``````function F!(xllp ::Vector{Vector{Int}}, xllw ::Vector{Vector{Float64}}, yllp ::Vector{Vector{Int}}, yllw ::Vector{Vector{Float64}}, Zp ::Array{Int,2}, Zw ::Array{Float64,2}) ::Nothing
m, n = size(Zw,1), length(xllp)
@spawn begin
# t is now fixed for this task
end
end
end
``````

However, I am not even sure that your MWE would give correct results when used with a static scheduler.
Can you explain how the `@tturbo` loop supposed to work with the `k` index and the `t` thread id?

In general you can’t control which `k` gets assigned which `t`, so your loop would save the results to `yllp, yllw` in random ordering when the function is ran multiple times (assuming static scheduling,e.g. one `k` for one `t`). [And if `nthreads() > n` you encounter a data race.]
Clarified in the next comment by OP. I had just confused `t` and `k` in the last line.

So perhaps you need to rethink your algorithm again before trying to continue parallelization.

And to work around the problem with `@turbo` and nested loops: You can remove the `@tturbo` macro from the start and add a `@turbo` (note only one `t`) in front of each inner for loop.
But since there is not crazy math going on here, I suspect that prefixing the whole loop with `@inbounds` might have the same effect.

`Task`s are not safer than using `@threads` directly. In any case you (the programmer) is responsible for avoiding race conditions, dead locks etc.

@fatteneder My function `F!`does the following: each `xllp[k]` contains positions (nonsorted and possibly with duplicates), each `xllw[k]` contains weights. The function sorts the positions and corresponding weights. Whenever two positions are equal, the corresponding weights are summed. The result (sorted unique positions and summed weights) is written into `yllp[k]` and `yllw[k]`.

This is achieved by writing positions and weights into arrays `Zp` and `Zw`, which are temporary, their content is irrelevant when the function call finishes. Hence, it is irrelevant which `k` is paired with which `t`, as long as each `t` does not change until the end of the call for each `k`.

Since the same procedure is performed for every `k`, I wish to parallelize it to the best of my abilities.

In general you can’t control which `k` gets assigned which `t` , so your loop would save the results to `yllp, yllw` in random ordering when the function is ran multiple times (assuming static scheduling,e.g. one `k` for one `t` ). [And if `nthreads() > n` you encounter a data race.]

Why would the results be stored into `yllp, yllw` in random order?!? My understanding of the algorithm is as follows (please stop me where I am wrong):
Each of the threads of the CPU picks a specific `k` and works on it. A thread `t` works on `k`-th array, does the computation on the preallocated space `Zd[:,t], Zw[:,t]`, at the start it sets all entries there to 0, then does the aggregation, then reads the sorted unique positions from there, and only the `t`-th thread reads and writes there. When it finishes, the results is written into `yllp[k], yllw[k]`.

So e.g. thread 8 might work on `xllp[3], xllw[3]`, compute on `Zd[:,t], Zw[:,t]`, then write the result into `yllp[3], yllw[3]`. I thought that during each iteration of `@threads for ...`, our `t` does not change. Is that not so? Where does the data race occur??

Apologize, I was wrong, I confused `k` with `t` in the assignments for `yllp, yllw` (which indeed would not make sense). Thanks for spelling it out for me. So I agree, under the assumption that `threadid()` is constant during a loop execution, the above would be thread safe.

`threadid()` is in general not constant during an iteration. That is what the linked blog post talks about.
Also see the (extended) help for `??Threads.@threads`:

``````  The @threads macro executes the loop body in an unspecified order and potentially concurrently. It does not
specify the exact assignments of the tasks and the worker threads. The assignments can be different for each
execution. The loop body code (including any code transitively called from it) must not make any assumptions
about the distribution of iterations to tasks or the worker thread in which they are executed. The loop body
for each iteration must be able to make forward progress independent of other iterations and be free from
data races. As such, invalid synchronizations across iterations may deadlock while unsynchronized memory
accesses may result in undefined behavior.

For example, the above conditions imply that:

•  The lock taken in an iteration must be released within the same iteration.

•  Communicating between iterations using blocking primitives like Channels is incorrect.

•  Write only to locations not shared across iterations (unless a lock or atomic operation is used).

•  The value of threadid() may change even within a single iteration
``````

However, one can restore the behavior for constant `threadid()` by writing `Threads.@threads :static for ...`. Also from the help:

``````:static
–––––––––

:static scheduler creates one task per thread and divides the iterations equally among them, assigning each
task specifically to each thread. In particular, the value of threadid() is guaranteed to be constant within
one iteration. Specifying :static is an error if used from inside another @threads loop or from a thread
other than 1.

│ Note
│
│  :static scheduling exists for supporting transition of code written before Julia 1.3. In newly
│  written library functions, :static scheduling is discouraged because the functions using this
│  option cannot be called from arbitrary worker threads.

``````

U, there’s an extended help `??`, fancy! : )

So let me get this straight. it can happen that e.g. thread 8 starts with the execution of the one of the iterations of the `@threads for ...`. Hence it is assigned to e.g. `k=1000`, creates `t=8`, and I think `t` is an integer, it is not dynamically changed. Then in the middle of the computation, thread 8 stops, leaves the variables frozen, and thread 4 continues with the computation on `k=1000`. Thread 4 still computes with `t=8`, since that variable is set only once. Thread 8 is assigned to e.g. `k=2000`, but creates the same `t=8` in its iteration. As threads 4 and 8 compute, they read and write to the same array `Zw[:,8]`. Is this how the data race happens?

This is a catastrophe. I’m so grateful you guys pointed this out. That would’ve been a really hard bug to detect.

Using instead `@threads :static for ...` makes sure that there is no concurrency? By that, I mean that when thread 8 is assigned `k=1000`, it will keep calculating until it finishes, no other thread will take over?

I am also experimenting with

``````ThreadsX.foreach
Folds.foreach
FLoops.@floop
Polyester.@batch
``````

Are those pure parallelizations, or do they also have this darn concurrency? In other words, if I substitute `@threads` with any of them, do I also get data races?

Yes.

Sorry, I don’t know much about the other packages.
At least the `FLoops.jl` package has a warning on using `threadid()` with `@reduce` in its readme.

Also: The `@spawn` pattern from above does not suffer from this problem, because you ‘pin’ a `t` to a task.

1 Like