How to use LoopVectorization.jl?

I have a question regarding LoopVectorization.jl. I’ve been trying to integrate it into my project but am stuck with some errors. Here is a MWE that reproduces it:

function F!(xllp ::Vector{Vector{Int}}, xllw ::Vector{Vector{Float64}}, yllp ::Vector{Vector{Int}}, yllw ::Vector{Vector{Float64}}, Zp ::Array{Int,2}, Zw ::Array{Float64,2}) ::Nothing
    m, n = size(Zw,1), length(xllp)
    @tturbo for k=1:n  
        t = Threads.threadid()  
        for i=1:m  Zw[i,t]=0 end   # clear from previous computation
        for j=1:length(xllp[k])  
            v, w = xllp[k][j], xllw[k][j]   
            Zw[v, t] += w end   # sum weights with same position
        i=1
        for ii=1:m   # recover sorted unique positions
            Zp[i,t] = ii
            Zw[i,t] = Zw[ii,t]
            i += (Zw[ii,t]==0 ? 0 : 1) end
        i-=1  
        yllp[k], yllw[k] = Zp[1:i,t], Zw[1:i,t]   # store result permanently 
    end; return nothing end; 

m,n,s, nt = 10^1,10^1,5, Threads.nthreads()  # create toy data
xllp = [rand(1:m,rand(0:s)) for j=1:n]
xllw = [rand(Float64,length(l)) for l in xllp]
yllp, yllw = Vector{Vector{Int}}(undef,n), Vector{Vector{Float64}}(undef,n)
Zp, Zw = fill(Int(0),m,nt), fill(Float64(0),m,nt)  # temporary arrays for calculations
F!(xllp,xllw,yllp,yllw,Zp,Zw)  # run F! on toy data

If you run this, you’ll get:

ERROR: LoadError: UndefVarError: `k` not defined
Stacktrace:
 [1] F!(xllp::Vector{Vector{Int64}}, xllw::Vector{Vector{Float64}}, yllp::Vector{Vector{Int64}}, yllw::Vector{Vector{Float64}}, Zp::Matrix{Int64}, Zw::Matrix{Float64})
   @ Main E:\Dropbox\programming_Julia\Q06\Q06.jl:6
 [2] top-level scope
   @ E:\Dropbox\programming_Julia\Q06\Q06.jl:26
 [3] include(fname::String)
   @ Base.MainInclude .\client.jl:478
 [4] top-level scope
   @ REPL[10]:1
in expression starting at E:\Dropbox\programming_Julia\Q06\Q06.jl:26

However, running the same code without @tturbo works as desired. Any idea on what the issue is? Is there a way to reorganize my code to be accepted by @tturbo ?

I can’t tell you how to fix it, but you might be hitting a (current) limitation of LoopVectorization. From the
README

The macro assumes that loop iterations can be reordered. It also currently supports simple nested loops, where loop bounds of inner loops are constant across iterations of the outer loop, and only a single loop at each level of loop nest. These limitations should be removed in a future version.

Also, “thread-local state”, i.e. indexing arrays by thread id, is bound to break. Please see

2 Likes

Aha, so it’s best to wait until LoopVectorization supports inner loops of different lengths. Ok.

Regarding threadid(), thank you for pointing out the danger of my pattern. However, I’m confused by your link. As I understand, my code above will potentially give incorrect results? Then how could I achieve what this function does, with Threads.@threads instead of @tturbo?

The blog post suggests using Tasks instead of @threads. Then when is using the latter even safe, and why does @threads exist then?? I’m not sure how to use @spawn in my case, as it’s a new macro for me.

You could adopt the parition example from the linked PSA post and write something like

function F!(xllp ::Vector{Vector{Int}}, xllw ::Vector{Vector{Float64}}, yllp ::Vector{Vector{Int}}, yllw ::Vector{Vector{Float64}}, Zp ::Array{Int,2}, Zw ::Array{Float64,2}) ::Nothing
    m, n = size(Zw,1), length(xllp)
    tasks = map(1:Threads.nthreads()) do t
       @spawn begin
           # t is now fixed for this task
       end
    end
    fetch.(tasks)
end

However, I am not even sure that your MWE would give correct results when used with a static scheduler.
Can you explain how the @tturbo loop supposed to work with the k index and the t thread id?

In general you can’t control which k gets assigned which t, so your loop would save the results to yllp, yllw in random ordering when the function is ran multiple times (assuming static scheduling,e.g. one k for one t). [And if nthreads() > n you encounter a data race.]
Clarified in the next comment by OP. I had just confused t and k in the last line.

So perhaps you need to rethink your algorithm again before trying to continue parallelization.

And to work around the problem with @turbo and nested loops: You can remove the @tturbo macro from the start and add a @turbo (note only one t) in front of each inner for loop.
But since there is not crazy math going on here, I suspect that prefixing the whole loop with @inbounds might have the same effect.

Tasks are not safer than using @threads directly. In any case you (the programmer) is responsible for avoiding race conditions, dead locks etc.

@fatteneder My function F!does the following: each xllp[k] contains positions (nonsorted and possibly with duplicates), each xllw[k] contains weights. The function sorts the positions and corresponding weights. Whenever two positions are equal, the corresponding weights are summed. The result (sorted unique positions and summed weights) is written into yllp[k] and yllw[k].

This is achieved by writing positions and weights into arrays Zp and Zw, which are temporary, their content is irrelevant when the function call finishes. Hence, it is irrelevant which k is paired with which t, as long as each t does not change until the end of the call for each k.

Since the same procedure is performed for every k, I wish to parallelize it to the best of my abilities.

In general you can’t control which k gets assigned which t , so your loop would save the results to yllp, yllw in random ordering when the function is ran multiple times (assuming static scheduling,e.g. one k for one t ). [And if nthreads() > n you encounter a data race.]

Why would the results be stored into yllp, yllw in random order?!? My understanding of the algorithm is as follows (please stop me where I am wrong):
Each of the threads of the CPU picks a specific k and works on it. A thread t works on k-th array, does the computation on the preallocated space Zd[:,t], Zw[:,t], at the start it sets all entries there to 0, then does the aggregation, then reads the sorted unique positions from there, and only the t-th thread reads and writes there. When it finishes, the results is written into yllp[k], yllw[k].

So e.g. thread 8 might work on xllp[3], xllw[3], compute on Zd[:,t], Zw[:,t], then write the result into yllp[3], yllw[3]. I thought that during each iteration of @threads for ..., our t does not change. Is that not so? Where does the data race occur??

Apologize, I was wrong, I confused k with t in the assignments for yllp, yllw (which indeed would not make sense). Thanks for spelling it out for me. So I agree, under the assumption that threadid() is constant during a loop execution, the above would be thread safe.

threadid() is in general not constant during an iteration. That is what the linked blog post talks about.
Also see the (extended) help for ??Threads.@threads:

  The @threads macro executes the loop body in an unspecified order and potentially concurrently. It does not                                                                                                                    
  specify the exact assignments of the tasks and the worker threads. The assignments can be different for each                                                                                                                   
  execution. The loop body code (including any code transitively called from it) must not make any assumptions                                                                                                                   
  about the distribution of iterations to tasks or the worker thread in which they are executed. The loop body                                                                                                                   
  for each iteration must be able to make forward progress independent of other iterations and be free from                                                                                                                      
  data races. As such, invalid synchronizations across iterations may deadlock while unsynchronized memory                                                                                                                       
  accesses may result in undefined behavior.                                                                                                                                                                                     
                                                                                                                                                                                                                                 
  For example, the above conditions imply that:                                                                                                                                                                                  
                                                                                                                                                                                                                                 
    •  The lock taken in an iteration must be released within the same iteration.                                                                                                                                                
                                                                                                                                                                                                                                 
    •  Communicating between iterations using blocking primitives like Channels is incorrect.                                                                                                                                    
                                                                                                                                                                                                                                 
    •  Write only to locations not shared across iterations (unless a lock or atomic operation is used).                                                                                                                         
                                                                                                                                                                                                                                 
    •  The value of threadid() may change even within a single iteration

However, one can restore the behavior for constant threadid() by writing Threads.@threads :static for .... Also from the help:

:static
  –––––––––

 :static scheduler creates one task per thread and divides the iterations equally among them, assigning each
  task specifically to each thread. In particular, the value of threadid() is guaranteed to be constant within
  one iteration. Specifying :static is an error if used from inside another @threads loop or from a thread
  other than 1.

  │ Note
  │
  │  :static scheduling exists for supporting transition of code written before Julia 1.3. In newly
  │  written library functions, :static scheduling is discouraged because the functions using this
  │  option cannot be called from arbitrary worker threads.

U, there’s an extended help ??, fancy! : )

So let me get this straight. it can happen that e.g. thread 8 starts with the execution of the one of the iterations of the @threads for .... Hence it is assigned to e.g. k=1000, creates t=8, and I think t is an integer, it is not dynamically changed. Then in the middle of the computation, thread 8 stops, leaves the variables frozen, and thread 4 continues with the computation on k=1000. Thread 4 still computes with t=8, since that variable is set only once. Thread 8 is assigned to e.g. k=2000, but creates the same t=8 in its iteration. As threads 4 and 8 compute, they read and write to the same array Zw[:,8]. Is this how the data race happens?

This is a catastrophe. I’m so grateful you guys pointed this out. That would’ve been a really hard bug to detect.

Using instead @threads :static for ... makes sure that there is no concurrency? By that, I mean that when thread 8 is assigned k=1000, it will keep calculating until it finishes, no other thread will take over?

I am also experimenting with

ThreadsX.foreach 
Folds.foreach
FLoops.@floop
Polyester.@batch

Are those pure parallelizations, or do they also have this darn concurrency? In other words, if I substitute @threads with any of them, do I also get data races?

Yes.

Sorry, I don’t know much about the other packages.
At least the FLoops.jl package has a warning on using threadid() with @reduce in its readme.

Also: The @spawn pattern from above does not suffer from this problem, because you ‘pin’ a t to a task.

1 Like