Why is my parallelized running time is twice of the single run

Hi Guys,

I am running following codes to parallelize my model running for 47 independent sites. In total the parallelization code runs for 120 seconds, but single run only costs 60 seconds…I am wondering why does the parallelisation code takes twice time of the single run…below are some details:

========================
Parallelisation code:

function parallelizeTEM!(selected_models, space_forcing, space_spinup_forcing, loc_forcing_t, space_output, space_land, tem_info, ::ThreadsParallelization)
    Threads.@threads for space_index ∈ eachindex(space_forcing)
        coreTEM!(selected_models, space_forcing[space_index], space_spinup_forcing[space_index], loc_forcing_t, space_output[space_index], space_land[space_index], tem_info)
    end
    return nothing
end

However, if I test the single running (for each site I’ve tested them all)

using BenchmarkTools

function bench_coreTEM_per_site(
    selected_models,
    space_forcing,
    space_spinup_forcing,
    loc_forcing_t,
    space_output,
    space_land,
    tem_info
)
    for i in eachindex(space_forcing)
        println("=== Benchmarking site $i ===")
        # pull out the i-th “slice” once:
        forcing_i      = space_forcing[i]
        spinup_i       = space_spinup_forcing[i]
        output_i       = space_output[i]
        land_i         = space_land[i]

        # now benchmark a single call to coreTEM! on site i
        b = @benchmark coreTEM!(
            $selected_models,
            $forcing_i,
            $spinup_i,
            $loc_forcing_t,
            $output_i,
            $land_i,
            $tem_info
        )

        display(b)
    end
end

# Then call it:
bench_coreTEM_per_site(
    selected_models,
    space_forcing,
    space_spinup_forcing,
    loc_forcing_t,
    space_output,
    space_land,
    tem_info
)

...
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 56.936 s (0.00% GC) to evaluate,
 with a memory estimate of 25.12 KiB, over 804 allocations.
...

But if I test the whole parallelizeTEM! function, it takes about 120 seconds to finish…

julia> @btime parallelizeTEM!(
           $selected_models,
           $space_forcing,
           $space_spinup_forcing,
           $loc_forcing_t,
           $space_output,
           $space_land,
           $tem_info,
           $tem_info.run.parallelization
       )
  120.445 s (38018 allocations: 1.29 MiB)

Then I tried to use profview to profile this code, I found half of the 120 seconds is spent on task_done_hook, wait, poptask…so why is that? and how can I reduce this time to nearly 60 seconds? Thanks!

======
Here is the profileview figure

1 Like

Stupid question: are you launching Julia with several threads?

yes (no question is stupid! haha),

julia> Threads.nthreads()
60
1 Like

Is there internal multithreading in the code you’re using? How large is space_forcing? Are the individual loop iterations of roughly the same size?

try to play with these?

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export BLAS_NUM_THREADS=1

space_forcing is very small…just the dataset with 5 years hourly data of 47 sites…it’s a list of 47 elements, each element is a 5 years of hourly data (float32)

is this somehow over subscribing or memory contention due to too many threads? What if you try 4 threads

1 Like

should be…if I switched to 47 threads, then in logging testing node, it is 60 seconds, same time if I only run one site…so why can’t I assign more threads to it?
Another weird thing is it becomes on-average 100 seconds if I submitted it to slurm jobs…no idea why…

1 Like

My understanding of this is pretty rudimentary, but too many threads means that the CPU is a lot busier scheduling all of the threads and also multiple threads might start working on something, occupying memory which then might need to be replaced when other threads start/continue running the same chunk of code. If this sort of interruption happens often, the speed at which memory gets copied around might be the limiting factor here. But again, this is just speculation, I would also be interested in a better explanation :sweat_smile:

When it comes to Slurm, there are a couple other things to consider, e.g.

  • is the node running the code via Slurm the same as you are using to test? (different hardware usually means different runtime due to differen CPUs, IO, memory, network could be busy, etc.)
  • how many CPUs were requested and how many were actually allocated? (Slurm usually counts virtual cores as individual CPUs, sometimes there is a minimum or maximum amount of CPUs that can be allocated, and so on)
1 Like