Why does Julia threads in the following way?

Here is a working example. There are two functions, the Coulomb_matrix() function just takes in a vector of Float numbers, and gives out a matrix of float numbers. The test() is having strange behaviour. Here is the full script:

function Coulomb_matrix(kvec::Vector{Float64})::Matrix{Float64}
   
    s1=[9047.5636*exp(-norm(kvec)) for ja in 1:10, jb in 1:10]
    cmatrix=kron(ones(Float64,4,4),s1)
    
    return cmatrix
  
end
  


  
function test()
   radius=1.4
   num_points=20
    dimension=40
    kx_grid=collect(range(-radius/2, stop=+radius/2, length=num_points))
    ky_grid=collect(range(-radius/2, stop=+radius/2, length=num_points))

   
  k_set=Vector{Float64}[]
  
      
  for ja in eachindex(kx_grid),jb in eachindex(ky_grid)  
      push!(k_set,[kx_grid[ja],ky_grid[jb]]) 
  end



   l1=zeros(ComplexF64,dimension,dimension,length(k_set))
   l2=zeros(ComplexF64,dimension,dimension,length(k_set))
  
   density_matrix=ones(ComplexF64,dimension,dimension,length(k_set))


  
  
    Threads.@threads for ja in eachindex(k_set)
        for jb in eachindex(k_set)
         cmatrix=Coulomb_matrix(k_set[ja]-k_set[jb])
        l2[:,:,ja]+=(cmatrix.*density_matrix[:,:,jb])
        end
    end

    println("maxl2",sort(vec(abs.(l2)))[end])
  
    
  
    Threads.@threads for ja in eachindex(k_set)
    for jb in eachindex(k_set)
    fcmatrix=Coulomb_matrix(k_set[ja]-k_set[jb])
    l1[:,:,ja]+=(fcmatrix.*density_matrix[:,:,jb])
    end
   end
   println("maxl1",sort(abs.(vec(l1)))[end])
     
  
  
  println("l1,l2",l1==l2)
  println(sum(abs.(l1-l2)))

 
 cmatrix=Coulomb_matrix([0.0,0.0])
  l3=diagm(cmatrix*diag(dropdims(sum(density_matrix,dims=3),dims=3)))
   
 
  
    
    
  end
  

I am seeing strange behaviour in the function test(). Supposedly l1 and l2 should be equal to each other. They are both three-dimensional ComplexF64 arrays. The only difference is that when I compute l1 and l2, they need to use some intermediate arrays from the Coulomb_matrix() function. In one case, I call the intermediate array cmatrix, the other case, I call it fcmatrix (below)

 Threads.@threads for ja in eachindex(k_set)
        for jb in eachindex(k_set)
         cmatrix=Coulomb_matrix(k_set[ja]-k_set[jb])
        l2[:,:,ja]+=(cmatrix.*density_matrix[:,:,jb])
        end
    end

  Threads.@threads for ja in eachindex(k_set)
    for jb in eachindex(k_set)
    fcmatrix=Coulomb_matrix(k_set[ja]-k_set[jb])
    l1[:,:,ja]+=(fcmatrix.*density_matrix[:,:,jb])
    end
   end

The strange thing is that the test(), as it is written, will report l1 is not equal to l2. I was guessing that this is because the threading syntax I wrote was not appropriate, i.e. I probably should not thread over l2[:,:,index], and perform in-place addition (actually why? I had the impression that this should be safe). Since in the code below, I only access one slice of l2 every time.

 Threads.@threads for ja in eachindex(k_set)
        for jb in eachindex(k_set)
         cmatrix=Coulomb_matrix(k_set[ja]-k_set[jb])
        l2[:,:,ja]+=(cmatrix.*density_matrix[:,:,jb])
        end
    end

But the strangest thing is that if you just delete the last two lines, the test() will just report l1 is indeed equal to l2, which confuses a lot. Why would this happen?

 cmatrix=Coulomb_matrix([0.0,0.0])
  l3=diagm(cmatrix*diag(dropdims(sum(density_matrix,dims=3),dims=3)))
   

Here’s the important parts:

What’s going wrong here is that the cmatrix inside the Threads.@threads block has the same name as the cmatrix outside of it, so julia is lexically capturing the variable cmatrix into the closure inside the @threads loop.

This makes it so that every iteration of the @threads block actually has a race condition in it when accessing cmatrix.

This is something I really dislike about Threads.@threads and how it makes it very easy to write code that causes subtle concurrency bugs.

One way you could fix this would be to write

    Threads.@threads for ja in eachindex(k_set)
        for jb in eachindex(k_set)
            local cmatrix=Coulomb_matrix(k_set[ja]-k_set[jb])
            l2[:,:,ja]+=(cmatrix.*density_matrix[:,:,jb])
        end
    end

and then cmatrix will be local to the closure and there’ll be no race condition.

1 Like

I think the issue is due to the scoping rules. Because you (later) declare a variable cmatrix, the cmatrix within the threaded for-loop will be same variable and get reassigned and reused by all threads, leading to race conditions.

See also Race condition caused by variable scope getting lifted from a multithreaded context · Issue #14948 · JuliaLang/julia · GitHub.

Simpler example

Consider these basic sum functions:

function sum1(x)
    tasks = map(1:nthreads()) do i0
        @spawn begin
            sum = 0.
            for j = i0:nthreads():length(x)
                sum += x[j]
            end
            return sum
        end
    end

    tsums = fetch.(tasks)
    sum = 0
    for i = 1:nthreads()
        sum += tsums[i]
    end
    return sum
end

function sum2(x)
    tasks = map(1:nthreads()) do i0
        @spawn begin
            tsum = 0.
            for j = i0:nthreads():length(x)
                tsum += x[j]
            end
            return tsum
        end
    end

    tsums = fetch.(tasks)
    sum = 0
    for i = 1:nthreads()
        sum += tsums[i]
    end
    return sum
end

where the only difference is that in the second case we called the per-thread sum tsum instead of sum. You’ll note that the behaviour of sum1 is non-deterministic

julia> x = rand(1000);

julia> sum1(x)
505.8078846565804

julia> sum1(x)
502.79831723789334

julia> sum1(x)
494.28161467864845

julia> sum1(x)
494.28161467864845

julia> sum1(x)
496.9478723181834

while sum2 works as you’d expect

julia> sum(x)
494.2816146786484

julia> sum2(x)
494.28161467864845

julia> sum2(x)
494.28161467864845

In sum1 the sum variable within the tasks is the same as outside of them, meaning that multiple threads will write to the same variable concurrently, leading to data race issues. You can fix this by renaming the variable, or explicitly declaring the inner sum variable to be local.

While we’re at it, the l2[:, :, ja] += ... is lowered to l2[:, :, ja] = l2[:, :, ja] + ..., causing an allocation of the sub matrix l2[:, :, ja], as well as density_matrix[:, : , jb].

You can avoid it by using views and broadcasting. Excessive allocations (and resulting garbage collections) can be devastating for performance in parallel programs. If possible, the coulomb matrices should also be precomputed.

Threads.@threads for ja in eachindex(k_set)
    for jb in eachindex(k_set)
        local cmatrix = Coulomb_matrix(k_set[ja]-k_set[jb])
        @views l2[:, :, ja] .+= cmatrix .* density_matrix[:, :, jb]
    end
end

Due to these unfortunate scoping rules, I declare everything local inside such constructs.

Hi, Thanks to this reply. But I still have one confusion. So the first time I decaled variable “cmatrix” outside the Threads loop is actually after the threads loop. Will this also make julia think that the “cmatrix” inside Threads loop is a global variable?

Yes.

Not a global variable, those originate in the global scope where your function definitions live. But yes, a local variable originates in the outermost local scope it is declared or assigned, and inner local scopes will use it by default even if they were written in preceding lines; you opt out and create a new variable with explicit local declarations. Where local variables originate are determined when the entire function scope is parsed from the global scope, so the order of the subsequent execution doesn’t matter.