Different running performance in parallelization of Threads

Hi Guys,

I am running a model which use a for-loop to run on 17 independently different sites. Each site does not share memory with each other. But all of the inputs for 17 sites are collected in a list, of which the elements are static array. then I was trying to use Threads.@threads to parallelize the for-loop. I am using exactly 17 threads to do it. However, I found quite a different performance between single run and parallelization run. Here is the script:

function parallelizeTEM!(space_selected_models::Vector, space_forcing, space_spinup_forcing, loc_forcing_t, space_output, space_land, tem_info, ::ThreadsParallelization)
    Threads.@threads for space_index ∈ eachindex(space_forcing)
        if haskey(tem_info, :use_space_spinup_sequence) && tem_info.use_space_spinup_sequence
            @time coreTEM!(space_selected_models[space_index], space_forcing[space_index], space_spinup_forcing[space_index], loc_forcing_t, space_output[space_index], space_land[space_index], tem_info, tem_info.space_spinup_sequence[space_index])
        end
    end
    return nothing
end

Then

julia> Threads.nthreads()
17

julia> space_index
1

julia> @btime coreTEM!($space_selected_models[space_index], $space_forcing[space_index], $space_spinup_forcing[space_index], $loc_forcing_t, $space_output[space_index], $space_land[space_index], $tem_info, $tem_info.space_spinup_sequence[space_index]);
  21.492 s (31 allocations: 89.19 KiB)

julia> @time coreTEM!(space_selected_models[space_index], space_forcing[space_index], space_spinup_forcing[space_index], loc_forcing_t, space_output[space_index], space_land[space_index], tem_info, tem_info.space_spinup_sequence[space_index]);
 21.392444 seconds (29 allocations: 30.578 KiB)

julia> parallelizeTEM!(space_selected_models, space_forcing, space_spinup_forcing, loc_forcing_t, space_output, space_land, tem_info, tem_info.run.parallelization);
 38.327439 seconds (459 allocations: 277.391 KiB)
 40.077714 seconds (626 allocations: 348.469 KiB)
 40.396848 seconds (354 allocations: 267.891 KiB)
 40.542786 seconds (993 allocations: 436.422 KiB)
 41.710067 seconds (665 allocations: 340.641 KiB)
 42.730745 seconds (843 allocations: 378.094 KiB)
 44.097149 seconds (585 allocations: 314.688 KiB)
 47.181715 seconds (468 allocations: 280.781 KiB)
 47.940587 seconds (1.30 k allocations: 504.531 KiB)
 48.239538 seconds (1.04 k allocations: 429.523 KiB)
 48.585902 seconds (987 allocations: 422.234 KiB)
 48.884628 seconds (723 allocations: 352.719 KiB)
 49.047153 seconds (849 allocations: 376.125 KiB)
 49.416455 seconds (1.30 k allocations: 507.266 KiB)
 52.010567 seconds (1.48 k allocations: 545.211 KiB)
 53.424446 seconds (1.34 k allocations: 522.047 KiB)
 53.869446 seconds (1.50 k allocations: 547.492 KiB)

julia> for space_index in 1:17
       println(space_index)
       @btime coreTEM!($space_selected_models[space_index], $space_forcing[space_index], $space_spinup_forcing[space_index], $loc_forcing_t, $space_output[space_index], $space_land[space_index], $tem_info, $tem_info.space_spinup_sequence[space_index]);
       end
1
  21.164 s (31 allocations: 89.19 KiB)
2
  22.174 s (31 allocations: 89.19 KiB)
3
  20.878 s (31 allocations: 89.19 KiB)
4
  21.739 s (31 allocations: 89.19 KiB)
5
  21.431 s (31 allocations: 89.19 KiB)
6
  21.357 s (31 allocations: 89.19 KiB)
7
  21.552 s (31 allocations: 89.19 KiB)
...

You could see that the first of the for-loop only takes 21 seconds…why is that? and how to improve the performance? Thanks!

What is the output of

julia> versioninfo()

?

Your function seems to allocate quite a lot, perhaps the threads are fighting for memory accesses, which would slow them all down compared to a single thread run?