Threads.@threads does not work properly

Hi everyone,

for the first time I am trying to exploit multi-threading in Julia. However, it seems I am doing some rookie mistake, since it does not work. I am running this very simple code (not a MWE)

t = 0.0
ℓ = 1000
ψ = randn(6, ℓ)

f = zeros(6, ℓ)

Threads.@threads for j in 1:ℓ
      f[:, j] = computeFreeDynamics(ψ[:, j], p, t)

where p is a structure of parameters and computeFreeDynamics is a “slow” function computing the free dynamics of a system.
I benchmarked the code with and without Threads.@threads and there is basically no change. However, if I try to use the same computeFreeDynamics function as RHS of an ODEProblem to run a MonteCarlo, Threads.@threads works as expected.
Do you have any idea what could be the problem?

make sure you start julia itself with multiple threads, e.g. like julia -t8 or julia --threads=auto

Yes, Julia is started with julia -t8 and, in fact, Threads.nthreads() gives 8 as output

Is computeFreeDynamics doing lots of matrix multiplications? if so, you will likely want to BLAS.set_num_threads(1) since otherwise all your cores will be used for the matrix multiplication, removing the ability to scale.

This allocates a new array for every ψ[:, j] and should be a @view. Moreover, you likely want to modify f[:, j] in place in computeFreeDynamics.

1 Like

In general, there is no guarantee that multithreading will speed up your code by the number of threads, or even speed it up at all. The best way to improve performance is by focusing on the sequential implementation, ensuring type-stability and in particular reducing allocations. Can you share more details about computeFreeDynamics?

1 Like

There’s little to go on in your example. As noted, if you’re doing some heavy lifting with linear algebra, it’s probably already parallelized.

There’s nothing wrong with the loop as such, except the allocations noted above. I.e. ψ[:, j] allocates a vector, and computeFreeDynamics also allocates a vector for return. Allocations in parallel programs can be a performance problem. But it’s possible to write fortran in any language, and make a pre allocated version computeFreeDynamics! so that the loop becomes

@threads for j in 1:ℓ
    computeFreeDynamics!(@view(f[:, j]), @view(ψ[:, j]), p, t)

You can of course keep an allocating version around as something like

computeFreeDynamics(args...) = computeFreeDynamics!(zeros(6), args...)