@threads for loop performance

Hi all,

Why my code is running slower when I use more number of threads in @threads for loop?run

I would like to know, if this is a standard behavior for threaded parallelization?

This is a sample code that will reproduce the behavior.

function check(n )
    y = [one( Array{ComplexF64}(undef,2,2) ) for i=1:n]
    t = [ zero( Array{ComplexF64}(undef,2,2) ) for i=1:n ]

   Threads.@threads for i=1:n
        for j=1:n
            for k=1:n
                t[i] += y[k]

@time check(400)


Can you post a minimal working example of the code? Otherwise, it’s hard to help.


I did not post it because it is quite big. However, I can explain it here. Overall what I am doing is following: I have an 1D array (length 10000) of 2x2 complex matrices and I apply some functions on theses matrices using the threaded for loop where each thread takes one element of the 1D array.

How many physical cores does your computer have?

I’d also use SMatrix from StaticArrays.jl for 2x2 complex matrices.


I have 28 cores. I will try static arrays.

In your case, the problem seems to be related to your use of an Array of 2D matrices, which leads to a lot of allocations. Using .+= reduces allocations by doing the update in place. Using 3D matrices as follows

    y = ones(ComplexF64,(2,2,n))
    t = zeros(ComplexF64,(2,2,n))

is 10x faster on my computer and I get an additional speed-up from multithreading.

Interestingly, I also found recently that the performance deteriorated when I added Threads.@threads to my main for loop. In my case, the solution was to switch off multithreading in BLAS by calling BLAS.set_num_threads(1). From that point on, I started to see a speed-up.
It would be useful if we could at least get a warning in such cases.




Okay, I will try. Thanks for the help.