Incorrect output when using Base.Threads

I am trying to write a code which has a computationally expensive triply nested loop. I am trying to parallelize the top lop using the @threads macro from Base.Threads. The loops involve calling to two functions GFC and GHT, the functions essentially take many matrices as parameters and return complex numbers. Below is the main part of my code that deals with allocation of variables and the loop

function calc()
    Δ = 2.02 # in eV
    dd = 3                               # Change this to modify number of points
    t =  range(-5*10^-12,stop=5*10^-12,length=dd)
    t =  collect(t);
    x =  zeros(ComplexF64,dd)
    yr =  zeros(Float64,dd)
    yi =  zeros(Float64,dd)
    Temp=300
    kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
    kT= kT * 2.29371227840*10^17         # in atomic units
    b = 1/kT
    Ωf = Matrix(Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))))*4.55633*10^-6   # Final state  frequencies in atomic units
    Ωi = Matrix(Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))))*4.55633*10^-6  # Initial state  frequencies in atomic units
    sysize = size(Ωi)[1]
    P = zeros(Float64,sysize,sysize)
    for i in range(1,sysize)
        P[i,i] = (2*sinh(b*Ωi[i,i]/2))
    end
    T = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
    T = T*T';
    D = readdlm("d.txt", Float64); # Displacement Matrix
    J = readdlm("j.txt", Float64); # Duchinsky Matrix
    Sf = zeros(ComplexF64,sysize,sysize)
    Si = zeros(ComplexF64,sysize,sysize)
    Bf = zeros(ComplexF64,sysize,sysize)
    Bi = zeros(ComplexF64,sysize,sysize)
    X11 = zeros(ComplexF64,sysize,sysize)
    X12 = zeros(ComplexF64,sysize,sysize)
    X21 = zeros(ComplexF64,sysize,sysize)
    X22 = zeros(ComplexF64,sysize,sysize)
    Y1 = zeros(ComplexF64,sysize,1)
    Y2 = zeros(ComplexF64,sysize,1)
    t[Int64((dd+1)/2)] = 10e-25


    ################## Nested Loop Over Time ###################################
    @threads for i in range(1,dd)
        # println("ITERATION NUMBER = $i, started by thread $(Threads.threadid())")
        s=0
        println("The value of s is $s")
        gfcpart=GFC(t[i],b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)[1]
        println("THE VALUE OF GFC is $gfcpart calculated by thread $(Threads.threadid()), t = $(t[i])")
        for m in range(1,sysize)
            for mp in range(1,sysize)
                s=s+    (GHT(t[i],b,sysize,Ωf,Ωi,J,D,m,mp,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2)[1]*gfcpart*T[m,mp])
            end
        end
        x[i]=s
        println("G(t) AT TIME t = $(t[i]) is calculated by thread $(Threads.threadid()) = $(x[i])")
        # println("ITERATION NUMBER = $i, ended by thread $(Threads.threadid())")
        # println("THE VALUE OF S is $s calculated by thread $(Threads.threadid())")
    end
    @threads for i in range(1,dd)
        yr[i] = real(x[i])
        yi[i] = imag(x[i])
    end
end

I have used multiple println() statements to figure out where the problem lies. Whenever I run it with a single thread a get a single (correct) output. But whenever I use more than one thread, I get a different wrong output every time.

I know that the feature exhibited by the output represents a classic case of incorrect parallelization. I know that each iteration of i is independent of others so there should be no problem. Also I tried declared x yr yi all as SharedArrays , that also did not help, from the println() statements I concluded that the variable gfcpart is being calculated incorrectly when using multiple threads.

Any help towards what I have to do to correct this problem is greatly appreciated.
PS - I have also triend to parallelize it using Distributed , declaring the commonly accessed arrays as SharedArrays and distributing the upper loop using @sync @distributed macros. That reproduced the correct serially attained output everytime. I am trying to do the same with Threads because as per the Julia manual it has the least overhead, and I plan to run my code on a single machine.

Your code suggests that the GFC function uses its arguments Si, Sf, etc as scratch-space, i.e. mutates these arguments. That would be a race condition.

2 Likes

Picking up on what @foobar_lv2 said. If GFC and GHT modify some of their inputs, then different threads interfere with each other. In your example the loop you want to parallelize only runs 3 times but I assume in reality it will be much more.

The easiest thing to remedy the race condition is to split the workload into chunks and preallocate the things that will be mutated for each chunk. This can be done quite conveniently with ChunkSplitters.jl. Here is the outline:

using ChunkSplitters
function calc()
    # define the variables, do precomputations, allocate output arrays

    ################## Nested Loop Over Time ###################################
    nchunks = Threads.nthreads() # define number of chunks e.g. equal to number of threads
    # @sync tells Julia to wait for any tasks started inside it
    @sync for (xrange, ichunk) in chunks(range(1,dd), nchunks)
        # xrange are the indices this chunk contains, ichunk is just the index of the chunk, which we don't need in this scenario
        Threads.@spawn begin # this line tells Julia to start a new thread with the contents of the begin block
            # now preallocate the workspaces here
            for i in xrange
                # rest of code for the computation
                # ...
                
            end # loop
        end # thread work load
    end # loop over chunks, Julia will wait until all task are completed
end
1 Like

Also, consider that using @threads is no longer considered the best way to parallelize code.

2 Likes