Why is my parallelized running time is twice of the single run

Xu_Shan · July 13, 2025, 1:43pm

Hi Guys,

I am running following codes to parallelize my model running for 47 independent sites. In total the parallelization code runs for 120 seconds, but single run only costs 60 seconds…I am wondering why does the parallelisation code takes twice time of the single run…below are some details:

========================
Parallelisation code:

function parallelizeTEM!(selected_models, space_forcing, space_spinup_forcing, loc_forcing_t, space_output, space_land, tem_info, ::ThreadsParallelization)
    Threads.@threads for space_index ∈ eachindex(space_forcing)
        coreTEM!(selected_models, space_forcing[space_index], space_spinup_forcing[space_index], loc_forcing_t, space_output[space_index], space_land[space_index], tem_info)
    end
    return nothing
end

However, if I test the single running (for each site I’ve tested them all)

using BenchmarkTools

function bench_coreTEM_per_site(
    selected_models,
    space_forcing,
    space_spinup_forcing,
    loc_forcing_t,
    space_output,
    space_land,
    tem_info
)
    for i in eachindex(space_forcing)
        println("=== Benchmarking site $i ===")
        # pull out the i-th “slice” once:
        forcing_i      = space_forcing[i]
        spinup_i       = space_spinup_forcing[i]
        output_i       = space_output[i]
        land_i         = space_land[i]

        # now benchmark a single call to coreTEM! on site i
        b = @benchmark coreTEM!(
            $selected_models,
            $forcing_i,
            $spinup_i,
            $loc_forcing_t,
            $output_i,
            $land_i,
            $tem_info
        )

        display(b)
    end
end

# Then call it:
bench_coreTEM_per_site(
    selected_models,
    space_forcing,
    space_spinup_forcing,
    loc_forcing_t,
    space_output,
    space_land,
    tem_info
)

...
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 56.936 s (0.00% GC) to evaluate,
 with a memory estimate of 25.12 KiB, over 804 allocations.
...

But if I test the whole parallelizeTEM! function, it takes about 120 seconds to finish…

julia> @btime parallelizeTEM!(
           $selected_models,
           $space_forcing,
           $space_spinup_forcing,
           $loc_forcing_t,
           $space_output,
           $space_land,
           $tem_info,
           $tem_info.run.parallelization
       )
  120.445 s (38018 allocations: 1.29 MiB)

Then I tried to use profview to profile this code, I found half of the 120 seconds is spent on task_done_hook, wait, poptask…so why is that? and how can I reduce this time to nearly 60 seconds? Thanks!

======
Here is the profileview figure

gdalle · July 13, 2025, 9:01pm

Stupid question: are you launching Julia with several threads?

Xu_Shan · July 14, 2025, 8:50am

yes (no question is stupid! haha),

julia> Threads.nthreads()
60

Sukera · July 14, 2025, 10:05am

Is there internal multithreading in the code you’re using? How large is space_forcing? Are the individual loop iterations of roughly the same size?

hz-xiaxz · July 14, 2025, 2:46pm

try to play with these?

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export BLAS_NUM_THREADS=1

Xu_Shan · July 14, 2025, 2:53pm

space_forcing is very small…just the dataset with 5 years hourly data of 47 sites…it’s a list of 47 elements, each element is a 5 years of hourly data (float32)

jling · July 14, 2025, 2:58pm

is this somehow over subscribing or memory contention due to too many threads? What if you try 4 threads

Xu_Shan · July 15, 2025, 2:45pm

should be…if I switched to 47 threads, then in logging testing node, it is 60 seconds, same time if I only run one site…so why can’t I assign more threads to it?
Another weird thing is it becomes on-average 100 seconds if I submitted it to slurm jobs…no idea why…

Sevi · July 15, 2025, 6:52pm

My understanding of this is pretty rudimentary, but too many threads means that the CPU is a lot busier scheduling all of the threads and also multiple threads might start working on something, occupying memory which then might need to be replaced when other threads start/continue running the same chunk of code. If this sort of interruption happens often, the speed at which memory gets copied around might be the limiting factor here. But again, this is just speculation, I would also be interested in a better explanation

When it comes to Slurm, there are a couple other things to consider, e.g.

is the node running the code via Slurm the same as you are using to test? (different hardware usually means different runtime due to differen CPUs, IO, memory, network could be busy, etc.)
how many CPUs were requested and how many were actually allocated? (Slurm usually counts virtual cores as individual CPUs, sometimes there is a minimum or maximum amount of CPUs that can be allocated, and so on)

Topic		Replies	Views
Parallelization seems to increase the necessary amount of allocations for single threads Performance	8	275	January 26, 2024
Embarrassingly parallel multi-threading doesn't scale Performance multithreading	17	1613	October 16, 2021
Expected 72X speedup, observed 1.1X Performance	5	731	April 28, 2021
A question about parallel performance in multithreading Performance question , performance , multithreading	10	658	May 12, 2022
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2629	December 1, 2021

Why is my parallelized running time is twice of the single run

Related topics