Hi Guys,
I am running following codes to parallelize my model running for 47 independent sites. In total the parallelization code runs for 120 seconds, but single run only costs 60 seconds…I am wondering why does the parallelisation code takes twice time of the single run…below are some details:
========================
Parallelisation code:
function parallelizeTEM!(selected_models, space_forcing, space_spinup_forcing, loc_forcing_t, space_output, space_land, tem_info, ::ThreadsParallelization)
Threads.@threads for space_index ∈ eachindex(space_forcing)
coreTEM!(selected_models, space_forcing[space_index], space_spinup_forcing[space_index], loc_forcing_t, space_output[space_index], space_land[space_index], tem_info)
end
return nothing
end
However, if I test the single running (for each site I’ve tested them all)
using BenchmarkTools
function bench_coreTEM_per_site(
selected_models,
space_forcing,
space_spinup_forcing,
loc_forcing_t,
space_output,
space_land,
tem_info
)
for i in eachindex(space_forcing)
println("=== Benchmarking site $i ===")
# pull out the i-th “slice” once:
forcing_i = space_forcing[i]
spinup_i = space_spinup_forcing[i]
output_i = space_output[i]
land_i = space_land[i]
# now benchmark a single call to coreTEM! on site i
b = @benchmark coreTEM!(
$selected_models,
$forcing_i,
$spinup_i,
$loc_forcing_t,
$output_i,
$land_i,
$tem_info
)
display(b)
end
end
# Then call it:
bench_coreTEM_per_site(
selected_models,
space_forcing,
space_spinup_forcing,
loc_forcing_t,
space_output,
space_land,
tem_info
)
...
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
Single result which took 56.936 s (0.00% GC) to evaluate,
with a memory estimate of 25.12 KiB, over 804 allocations.
...
But if I test the whole parallelizeTEM!
function, it takes about 120 seconds to finish…
julia> @btime parallelizeTEM!(
$selected_models,
$space_forcing,
$space_spinup_forcing,
$loc_forcing_t,
$space_output,
$space_land,
$tem_info,
$tem_info.run.parallelization
)
120.445 s (38018 allocations: 1.29 MiB)
Then I tried to use profview
to profile this code, I found half of the 120 seconds
is spent on task_done_hook
, wait
, poptask
…so why is that? and how can I reduce this time to nearly 60 seconds? Thanks!
======
Here is the profileview figure