Probable data race condition causing problems when trying to parallelize a loop used to populate an array

abraemer · August 1, 2024, 3:27pm

Uranium238:

    X11 = zeros(ComplexF64, sysize, sysize)
    X12 = zeros(ComplexF64, sysize, sysize)
    X21 = zeros(ComplexF64, sysize, sysize)
    X22 = zeros(ComplexF64, sysize, sysize)
    Y1 = zeros(ComplexF64, sysize, 1)
    Y2 = zeros(ComplexF64, sysize, 1)
    Hso = 0.48 * 0.0000046  ## SOC between S1 and T2 = 0.48 cm-1, converting to atomic units
    @timeit to "GT calculation" @sync @distributed for i in 1:dd
        # (1) this line writes to all X matrices
        gdsopart = GDSO(t[i], b, Δ, sysize, Ωf, Ωi, J, D, P, Si, Sf, Bi, Bf)
        # (2) this line uses them
        gtotal = GSV(t[i], b, sysize, Ωf, Ωi, J, D, Si, Sf, Bi, Bf, X11, X12, X21, X22, Y1, Y2, gdsopart, T,Hso)
        # (3) this does not work at all!
        x[i] = gtotal
        #[..]
    end # loop

The issue is not a race condition. I didn’t see one during a quick read. You have misunderstood the fundamental difference between using processes (via Distributed.jl) and threads. You cannot just slap @distributed on the code and expect it to just work. You need to think about moving data between workers.

Let me try to explain:
Workers have totally different memory spaces on your machine. In fact for all practical purposes they could live on physically different machines. So when you preallocate the X123 matrices outside the loop this happens on the main process. When you access a variable inside @distributed (line (1) that is declared outside, Distributed.jl copies it to the process and writes into the local copy. Thus you have no data race. But the information is local to the process! But it also means that line (3) does not move data back to the main process. What happens is that the empty array x is copied to each process and each process writes something into its local version and then it get garbage collected because you never transport it back.

So what should you do:
1.) Partition your tasks (e.g. using ChunkSplitters.jl) into as many chunks as you have workers
2.) use @distributed to distribute the chunks
3.) Preallocate the necessary arrays X123 on each worker
4.) Perform the workload on each worker
5.) Write the results to a SharedArray from SharedArrays.jl
6.) If there are large constant arrays used as input for the calculations, then consider using a SharedArray for that as well to avoid copying it to every worker.

Side note:
I think that you could likely optimize your algebra code quite a bit and thought that you got some input on that in an earlier thread here. Did you perhaps base this on an older version of your code?

Uranium238:

    W = [(Bf+J'*Bi*J) -(Sf + J' * Si * J)
        -(Sf + J' * Si * J) (Bf+J'*Bi*J)]
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]

    ########## Defining the X and Y Matrices ##################################


    ################# Final Calculation and returning value ##################
    s = zero(ComplexF64)
    MJSJ = J' * Si * J
    MJBJ = J' * Bi * J
    MJUD = J' * U * D
    Win = inv(W)
    WinV = Win * V

Topic		Replies	Views
How to convert a thread-parallelized code into a core-parallelized code? Julia at Scale multithreading , linearalgebra , distributed , threads , matrix	3	322	May 19, 2024
How to improve the scaling of Julia code aimed at multi-node parallelization? Julia at Scale linearalgebra , distributed	38	686	August 14, 2024
Parallel is very slow General Usage parallel	16	4722	March 9, 2018
Strange comportment of @distributed Performance question , distributed , loops	4	683	March 25, 2020
Data structures for threaded computing Performance	23	2883	October 23, 2019

Probable data race condition causing problems when trying to parallelize a loop used to populate an array

Related topics