Different running performance in parallelization of Threads

Hi Guys,

I am running a model which use a for-loop to run on 17 independently different sites. Each site does not share memory with each other. But all of the inputs for 17 sites are collected in a list, of which the elements are static array. then I was trying to use Threads.@threads to parallelize the for-loop. I am using exactly 17 threads to do it. However, I found quite a different performance between single run and parallelization run. Here is the script:

function parallelizeTEM!(space_selected_models::Vector, space_forcing, space_spinup_forcing, loc_forcing_t, space_output, space_land, tem_info, ::ThreadsParallelization)
    Threads.@threads for space_index ∈ eachindex(space_forcing)
        if haskey(tem_info, :use_space_spinup_sequence) && tem_info.use_space_spinup_sequence
            @time coreTEM!(space_selected_models[space_index], space_forcing[space_index], space_spinup_forcing[space_index], loc_forcing_t, space_output[space_index], space_land[space_index], tem_info, tem_info.space_spinup_sequence[space_index])
        end
    end
    return nothing
end

Then

julia> Threads.nthreads()
17

julia> space_index
1

julia> @btime coreTEM!($space_selected_models[space_index], $space_forcing[space_index], $space_spinup_forcing[space_index], $loc_forcing_t, $space_output[space_index], $space_land[space_index], $tem_info, $tem_info.space_spinup_sequence[space_index]);
  21.492 s (31 allocations: 89.19 KiB)

julia> @time coreTEM!(space_selected_models[space_index], space_forcing[space_index], space_spinup_forcing[space_index], loc_forcing_t, space_output[space_index], space_land[space_index], tem_info, tem_info.space_spinup_sequence[space_index]);
 21.392444 seconds (29 allocations: 30.578 KiB)

julia> parallelizeTEM!(space_selected_models, space_forcing, space_spinup_forcing, loc_forcing_t, space_output, space_land, tem_info, tem_info.run.parallelization);
 38.327439 seconds (459 allocations: 277.391 KiB)
 40.077714 seconds (626 allocations: 348.469 KiB)
 40.396848 seconds (354 allocations: 267.891 KiB)
 40.542786 seconds (993 allocations: 436.422 KiB)
 41.710067 seconds (665 allocations: 340.641 KiB)
 42.730745 seconds (843 allocations: 378.094 KiB)
 44.097149 seconds (585 allocations: 314.688 KiB)
 47.181715 seconds (468 allocations: 280.781 KiB)
 47.940587 seconds (1.30 k allocations: 504.531 KiB)
 48.239538 seconds (1.04 k allocations: 429.523 KiB)
 48.585902 seconds (987 allocations: 422.234 KiB)
 48.884628 seconds (723 allocations: 352.719 KiB)
 49.047153 seconds (849 allocations: 376.125 KiB)
 49.416455 seconds (1.30 k allocations: 507.266 KiB)
 52.010567 seconds (1.48 k allocations: 545.211 KiB)
 53.424446 seconds (1.34 k allocations: 522.047 KiB)
 53.869446 seconds (1.50 k allocations: 547.492 KiB)

julia> for space_index in 1:17
       println(space_index)
       @btime coreTEM!($space_selected_models[space_index], $space_forcing[space_index], $space_spinup_forcing[space_index], $loc_forcing_t, $space_output[space_index], $space_land[space_index], $tem_info, $tem_info.space_spinup_sequence[space_index]);
       end
1
  21.164 s (31 allocations: 89.19 KiB)
2
  22.174 s (31 allocations: 89.19 KiB)
3
  20.878 s (31 allocations: 89.19 KiB)
4
  21.739 s (31 allocations: 89.19 KiB)
5
  21.431 s (31 allocations: 89.19 KiB)
6
  21.357 s (31 allocations: 89.19 KiB)
7
  21.552 s (31 allocations: 89.19 KiB)
...

You could see that the first of the for-loop only takes 21 seconds…why is that? and how to improve the performance? Thanks!

What is the output of

julia> versioninfo()

?

Your function seems to allocate quite a lot, perhaps the threads are fighting for memory accesses, which would slow them all down compared to a single thread run?

yes it does. But we are running it 24365400 times…so I think the allocation might be small enough compared to the number of running loops?

Julia Version 1.11.5

Sorry, you did not answer my question. I asked for the output of the function versioninfo(), mainly because it also gives info about the CPU and the OS that you are using.

Ah sorry, I was thinking you just wanted the first line of that results…here are the all returned message:

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 10 × Apple M4
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 17 default, 0 interactive, 4 GC (on 4 virtual cores)

The first thing I notice is that you have a 10-core CPU, but use 17 threads. This is normally a bad choice for running Julia programs. In most cases, I would not use more threads than the number of performance cores. The M4 has probably just 4 performance cores, depending on the version of the M4 you have.

If you run the M4 using performance plus efficiency cores, then most of the threads will run on the efficiency cores at 2.89 GHz. If you run only one thread, it will probably run a the performance core with 4.46 GHz.

This difference might already improve the single-threaded performance by a factor of 1.54.

Conclusion:

  • Start Julia with julia -t 4 or julia -t 10, depending on the question if you want to use only the performance cores or all cores
  • If you use 10 cores, expect that the performance per thread is by a factor of 1.54 lower due to the lower clock speed of the efficiency cores
  • In addition, check how much of the time is used by the garbage collector. You can also try to run the garbage collector on multiple cores. For best performance, try to reduce memory allocations as much as possible.
  • Because you have an Apple CPU, the memory bandwidth is most likely less of a problem. On AMD or Intel CPUs, this can cause a lot of additional slowdown for multithreaded programs.

Thanks for your reply! Hereby I transferred the model to a HPC cluster (by using srun --pty -p big -t 08:00:00 --mem=100G -c 17 bash -l to launch an interactive node to test it), results are similar…below are the information of my HPC nodes

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, cascadelake)
Threads: 17 default, 0 interactive, 8 GC (on 128 virtual cores)
Environment:
  JULIA_PKG_PRECOMPILE_AUTO = 0
  JULIA_DEPOT_PATH = /User/homes/xshan/.julia_sindbad_clean
  JULIA_LOAD_PATH = @:@stdlib
  JULIA_NUM_THREADS = 17

julia> Threads.nthreads()
17

Well, on the HPC cluster, it is not clear if other users are working on it at the same time. If you are the only one, the result is no surprise:

With one core, the CPU runs at 3.9 GHz, with all cores active at 2.3 GHz. In addition, execution time might be limited by the cache size or cache throughput.

For a large number of cores, it might be better to use multiple tasks and let each of them run a small number of threads.

To analyze this further, you could:

  • make a plot about performance vs thread-count
  • share the algorithm so that we could look if cache or memory throughput are the limiting factor

If you achieve 8 times the performance with 16 threads on 16 cores you would already be very good.

Thanks for your reply!
For the algorithm, which algorithm you are referring to? I only have a script for pseudo code which is already shared in the post of the question…

I am also wondering is there any better way to write the parallelization?

Some resource is clearly shared between the threads. It can be many. I doubt it’s the gc, it’s not used much. It could be memory bandwidth, it could be false sharing, or cache thrashing, it’s hard to tell without knowing what’s going on inside coreTEM!. There are tools for analyzing deeper, like LinuxPerf.jl, LIKWID.jl.

1 Like

I agree with @sgaure and my best guess would be that there is false sharing. Is there a single small array where all the threads are writing to? Can you comment out the writing part and see if it helps? Or make all the writes local temporarily?

Another ideas to try is pinthreads(:numa) form GitHub - carstenbauer/ThreadPinning.jl: Readily pin Julia threads to CPU-threads to prevent threads migrating between cpu cores and rule out competition for cache - although running on hpc cluster you have made this issue less likely as it has loads of cores.

sorry that I just got what you meant about algorithm…here is what is inside coreTEM!

function coreTEM!(selected_models, loc_forcing, loc_spinup_forcing, loc_forcing_t, loc_output, loc_land, tem_info, spinup_sequence)
    # update the loc_forcing with the actual location
    loc_forcing_t = getForcingForTimeStep(loc_forcing, loc_forcing_t, 1, tem_info.vals.forcing_types)
    # run precompute
    land_prec = precomputeTEM(selected_models, loc_forcing_t, loc_land, tem_info.model_helpers) 
    # run spinup
    land_spin = spinupTEM(selected_models, loc_spinup_forcing, loc_forcing_t, land_prec, tem_info, spinup_sequence, tem_info.run.spinup_TEM)

    timeLoopTEM!(selected_models, loc_forcing, loc_forcing_t, loc_output, land_spin, tem_info.vals.forcing_types, tem_info.model_helpers, tem_info.vals.output_vars, tem_info.n_timesteps, tem_info.run.debug_model)
    return nothing
end

function timeLoopTEM!(selected_models, loc_forcing, loc_forcing_t, loc_output, land, forcing_types, model_helpers, output_vars, n_timesteps, ::DoNotDebugModel) # do not debug the models
    # n_timesteps=1
    for ts ∈ 1:n_timesteps
        f_ts = getForcingForTimeStep(loc_forcing, loc_forcing_t, ts, forcing_types)
        land = computeTEM(selected_models, f_ts, land, model_helpers)
        setOutputForTimeStep!(loc_output, land, ts, output_vars)
    end
end

The most time consuming function should be timeLoopTEM!, because there is one of the for loop within it…(calculating from one year one month one day one hour, to another hour at another day/month/year…

timeLoopTEM! takes loc_forcing as the input, then write into loc_output. Both of these two are the elements in the vector of space_forcing and space_output, i.e. the I-th element of space_forcing and space_outputm are assigned to the I-th thread by @threads

seems does not work…following ChatGPT’s suggestion, I used:

using ThreadPinning
ThreadPinning.pinthreads(:affinitymask; nthreads=Threads.nthreads(), threadpool=:default)

at the beginning of my Julia outer script (the for loop using threads is within a module/package). Then the results of @time and @btime are:

julia> space_index = 3
3

julia> @time coreTEM!(space_selected_models, space_forcing[space_index], space_spinup_forcing[space_index], 
                       loc_forcing_t, space_output[space_index], space_land[space_index], tem_info, tem_info.space_spinup_sequence[space_index])
 23.099125 seconds (28 allocations: 31.656 KiB)

julia> @btime coreTEM!($space_selected_models, $space_forcing[$space_index], $space_spinup_forcing[$space_index], 
                       $loc_forcing_t, $space_output[$space_index], $space_land[$space_index], $tem_info, $tem_info.space_spinup_sequence[$space_index])

       # using BenchmarkTools
       # for space_index = 1:length(space_forcing)
       #     println("Running site index: ", space_index)
       #     @btime coreTEM!($space_selected_models, $space_forcing[$space_index], $space_spinup_forcing[$space_index], 
       #                 $loc_forcing_t, $space_output[$space_index], $space_land[$space_index], $tem_info, $tem_info.space_spinup_sequence[$space_index])
       # end
  22.832 s (30 allocations: 32.64 KiB)

julia> @time runTEM!(info.models.forward, run_helpers.space_forcing, run_helpers.space_spinup_forcing, run_helpers.loc_forcing_t, run_helpers.space_output, run_helpers.space_land, run_helpers.tem_info)
 66.328300 seconds (593 allocations: 692.938 KiB)

julia> @btime runTEM!($info.models.forward, $run_helpers.space_forcing, $run_helpers.space_spinup_forcing, $run_helpers.loc_forcing_t, $run_helpers.space_output, $run_helpers.space_land, $run_helpers.tem_info)
  63.320 s (586 allocations: 615.27 KiB)

It 3 times of the single thread running…

I suggest you try commenting out writes to shared state and see if it helps

Thanks, but then how to collect results from different loops?

I forgot…the most important thing is the data input length for each thread is different…for example, some sites could be 200 years, some could be 300 years…would that be an issue?

Just try if commenting helps with the performance. If yes, then we can try to answer this question, if no, we do not need to think about it.

Do these functions read data from the disc?