Threads.@threads gives unexpected results

Hi, I am trying to optimize a simple function. I have written both devectorized and vectorized version codes for the function to see which one is faster. Here are the codes:
Vectorized:

function funvec(m::Int64,data::Array{Float64,2})

    a,b = data[:,1], data[:,2]
    M = Int(length(data[:,1]))
    n = Int(m*(m+1)/2)
    u = zeros(M,n)
    v = zero(u)
    k = 1
    for i in 1:m
       for j in 1:i #
              u[:,k] .= @. a^(i-j)*b^(j-1)
              if ((i-j)!=0)
                  v[:,k] .= @. (i-j)*a^(i-j-1)*b^(j-1)
              end

          k=k+1
       end
    end
    return u,v
end

Devectorized:


function fundevec(m::Int64,data::Array{Float64,2})

    a,b = data[:,1], data[:,2]
    M = Int(length(data[:,1]))
    n = Int(m*(m+1)/2)
    u = zeros(M,n)
    v = zero(u)
    k = 1
    for i in 1:m
      for j in 1:i
          Threads.@threads for s in 1:M
                  u[s,k] =  a[s]^(i-j)*b[s]^(j-1)
                  if ((i-j)!=0)
                      v[s,k] =  (i-j)*a[s]^(i-j-1)*b[s]^(j-1)
                  end
            end
          k=k+1
       end
    end
    return u,v
end

Call them as

const m,N  =  200,200

data = rand(N,2)
@benchmark  funvec(m,data)

@benchmark fundevec(m,data)

I have two questions.
1-)
The above two functions are giving exactly same results and fundevec is faster for now. To make funvec faster I modify (by adding Threads.@threads before for loop) it as follows

function funvec(m::Int64,data::Array{Float64,2})

    a,b = data[:,1], data[:,2]
    M = Int(length(data[:,1]))
    n = Int(m*(m+1)/2)
    u = zeros(M,n)
    v = zero(u)
    k = 1
    for i in 1:m
       Threads.@threads for j in 1:i #
              u[:,k] .= @. a^(i-j)*b^(j-1)
              if ((i-j)!=0)
                  v[:,k] .= @. (i-j)*a^(i-j-1)*b^(j-1)
              end

          k=k+1
       end
    end
    return u,v
end

But in this case I can not get same results with fundevec. Can you explain why I get different results when I added Threads.@threads?

2-) Can I optimize further, one of these two functions?

Thank you for your help.

One thought, that k=k+1 may not be synchronised.

1 Like

This k is a function of i,j which you can work out in closed form, something like j-1 + i*(1+i)÷2. Once you have that, it should be safe to thread the outermost loop, since iterations at different i never write into the same place.

However, Threads.@threads will divide the work equally, it would probably be better to use Threads.@spawn to make some number of chunks which can be queued up. Something like ThreadsX.foreach may automate this for you.

Besides threading, your loops probably want @inbounds, and avoiding a branch via v[s,k] = ifelse(i==j, 0.0, (i-j)*a[s]^(i-j-1)*b[s]^(j-1)) might help. LoopVectorization.@avx might also speed things up.

You might also be able to re-write things to accumulate these powers, instead of calling ^. You always have b[s]^(j-1) so could start at 1.0 and multiply that by b[s] at each step. But you’d have to figure out where to store it, possibly re-order the loops…

Thank you for your suggestions.