Why is my parallelized running time is twice of the single run

Hi Guys,

I am running following codes to parallelize my model running for 47 independent sites. In total the parallelization code runs for 120 seconds, but single run only costs 60 seconds…I am wondering why does the parallelisation code takes twice time of the single run…below are some details:

========================
Parallelisation code:

function parallelizeTEM!(selected_models, space_forcing, space_spinup_forcing, loc_forcing_t, space_output, space_land, tem_info, ::ThreadsParallelization)
    Threads.@threads for space_index ∈ eachindex(space_forcing)
        coreTEM!(selected_models, space_forcing[space_index], space_spinup_forcing[space_index], loc_forcing_t, space_output[space_index], space_land[space_index], tem_info)
    end
    return nothing
end

However, if I test the single running (for each site I’ve tested them all)

using BenchmarkTools

function bench_coreTEM_per_site(
    selected_models,
    space_forcing,
    space_spinup_forcing,
    loc_forcing_t,
    space_output,
    space_land,
    tem_info
)
    for i in eachindex(space_forcing)
        println("=== Benchmarking site $i ===")
        # pull out the i-th “slice” once:
        forcing_i      = space_forcing[i]
        spinup_i       = space_spinup_forcing[i]
        output_i       = space_output[i]
        land_i         = space_land[i]

        # now benchmark a single call to coreTEM! on site i
        b = @benchmark coreTEM!(
            $selected_models,
            $forcing_i,
            $spinup_i,
            $loc_forcing_t,
            $output_i,
            $land_i,
            $tem_info
        )

        display(b)
    end
end

# Then call it:
bench_coreTEM_per_site(
    selected_models,
    space_forcing,
    space_spinup_forcing,
    loc_forcing_t,
    space_output,
    space_land,
    tem_info
)

...
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 56.936 s (0.00% GC) to evaluate,
 with a memory estimate of 25.12 KiB, over 804 allocations.
...

But if I test the whole parallelizeTEM! function, it takes about 120 seconds to finish…

julia> @btime parallelizeTEM!(
           $selected_models,
           $space_forcing,
           $space_spinup_forcing,
           $loc_forcing_t,
           $space_output,
           $space_land,
           $tem_info,
           $tem_info.run.parallelization
       )
  120.445 s (38018 allocations: 1.29 MiB)

Then I tried to use profview to profile this code, I found half of the 120 seconds is spent on task_done_hook, wait, poptask…so why is that? and how can I reduce this time to nearly 60 seconds? Thanks!

======
Here is the profileview figure

Stupid question: are you launching Julia with several threads?

yes (no question is stupid! haha),

julia> Threads.nthreads()
60
1 Like

Is there internal multithreading in the code you’re using? How large is space_forcing? Are the individual loop iterations of roughly the same size?

try to play with these?

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export BLAS_NUM_THREADS=1

space_forcing is very small…just the dataset with 5 years hourly data of 47 sites…it’s a list of 47 elements, each element is a 5 years of hourly data (float32)

is this somehow over subscribing or memory contention due to too many threads? What if you try 4 threads