What is the mechanics of `copy` in the thread-parallization by `Threads.@spawn`? Or: Problems with copy arrays in thread parallelism

I hope to implement a thread-parallel sample, a common example in solving time-dependent evolution problems (e.g. main_serial in the code.

Generate a matrix arr_task over time. This matrix is updated over loop/time and the matrix arr_task generated by the current loop is dependent on the result of arr_task generated by the last loop/time, so it must be generated serially. Then pass arr_task into process!() to be processed, and the result of the process!() is stored in matrix arr _stored. These two steps give the result of processing the arr_task at each time.

The main_serial in the code is a serial program. For the sake of parallelism I defined a function main_parallel_copy where I tried to use copy to pass arr_task into process!() for processing based on multithreads. However the result is different from the serial program, it seems that there is a data race (presumed based on the output after I run it). With the help of julia Chinese group, I assigned arr_task to arr_temp: arr_temp = copy(arr_task), then passed arr_temp into process!() as shown in the function main_parallel_assign.

Running environment:

win10: Microsoft Windows [Version 10.0.17763.4252]
julia Version 1.8.5
How to run: julia -t4 test.jl
This is the code from the test.jl script:

function main_serial()
    arr_stored = zeros(Int, 4)
    arr_task = zeros(Int, 1)
    for time in 1:4
        # get a task, should be serial
        arr_task[1] = time + arr_task[1]
        sleep(1)

        # handle the task
        process!(arr_stored, arr_task, time)
    end
    println("serial : $arr_stored")
end

function main_parallel_copy()
    arr_stored = zeros(Int, 4)
    arr_task = zeros(Int, 1)
    @sync for time in 1:4
        # get a task, should be serial
        # println("copy!")
        arr_task[1] = time + arr_task[1]
        sleep(1)

        # handle the task, but parallel on threads
        Threads.@spawn process!(arr_stored, copy(arr_task), time)
    end
    println("parallel copy : $arr_stored")
end

function main_parallel_assign()
    arr_stored = zeros(Int, 4)
    arr_task = zeros(Int, 1)
    @sync for time in 1:4
        # get a task, should be serial
        arr_task[1] = time + arr_task[1]
        sleep(1)
        arr_temp = copy(arr_task)

        # handle the task, but parallel on threads
        Threads.@spawn process!(arr_stored, arr_temp, time)
    end
    println("parallel assign : $arr_stored")
end

function process!(stored, task, t)
    # time of processing
    @time begin a = rand(100,100)
        [exp(a) for i in 1:100]
    end
    stored[t] = task[1]
end

@time main_serial()
println()
@time main_parallel_copy()
println()
@time main_parallel_assign()

These are the output:

  2.040268 seconds (1.50 k allocations: 46.069 MiB, 2.20% gc time)
  2.136096 seconds (1.50 k allocations: 46.069 MiB, 1.82% gc time)
  2.072491 seconds (1.50 k allocations: 46.069 MiB, 1.06% gc time)
  2.073747 seconds (1.50 k allocations: 46.069 MiB, 0.64% gc time)
serial : [1, 3, 6, 10]
 12.372814 seconds (6.39 k allocations: 184.297 MiB, 0.96% gc time, 0.32% compilation time)

  2.692697 seconds (2.70 k allocations: 82.098 MiB, 1.35% gc time)
  2.884239 seconds (3.36 k allocations: 101.799 MiB, 1.26% gc time)
  2.676654 seconds (2.80 k allocations: 84.476 MiB, 1.36% gc time)
  2.124010 seconds (1.52 k allocations: 46.069 MiB, 1.87% gc time)
parallel copy : [3, 6, 10, 10]
  7.806365 seconds (7.28 k allocations: 184.345 MiB, 0.98% gc time, 0.26% compilation time)

  2.763133 seconds (2.64 k allocations: 80.335 MiB, 2.10% gc time)
  3.105436 seconds (3.35 k allocations: 101.645 MiB, 1.87% gc time)
  2.889880 seconds (2.78 k allocations: 83.787 MiB, 3.14% gc time)
  2.118461 seconds (1.52 k allocations: 46.069 MiB, 0.87% gc time)
parallel assign : [1, 3, 6, 10]
  8.015358 seconds (6.95 k allocations: 184.329 MiB, 1.36% gc time, 0.12% compilation time)

My question are:

  1. why is it that after copying an array into a function , the array can still be updated externally? See main_parallel_copy.
  2. How else can I parallelize this type of loop based on threads?
  3. In julia threads parallelization, is there any option to set variable attribute like “firstprivate” in openmp?
  1. As copy is happening in the spawned thread, it’s executing to late, i.e., the loop has already advanced to the sleep in the next iteration when the thread has started up. (Sleeping before the update to arr_task will give the right result – but does not actually fix it as the race condition is still there if for some reason the thread starts up very slowly).
  2. Make sure the copy is done before passing the data into the thread, i.e., main_parallel_assign is a correct solution.
  3. That I don’t know.

Mutable data and threading is always fun (or a recipe for disaster depending on your point of view).