Why my code is running slower when I use more number of threads in @threads for loop?

I would like to know, if this is a standard behavior for threaded parallelization?

This is a sample code that will reproduce the behavior.

function check(n )
y = [one( Array{ComplexF64}(undef,2,2) ) for i=1:n]
t = [ zero( Array{ComplexF64}(undef,2,2) ) for i=1:n ]
Threads.@threads for i=1:n
for j=1:n
for k=1:n
t[i] += y[k]
end
end
end
end
@time check(400)

I did not post it because it is quite big. However, I can explain it here. Overall what I am doing is following: I have an 1D array (length 10000) of 2x2 complex matrices and I apply some functions on theses matrices using the threaded for loop where each thread takes one element of the 1D array.

In your case, the problem seems to be related to your use of an Array of 2D matrices, which leads to a lot of allocations. Using .+= reduces allocations by doing the update in place. Using 3D matrices as follows

y = ones(ComplexF64,(2,2,n))
t = zeros(ComplexF64,(2,2,n))

is 10x faster on my computer and I get an additional speed-up from multithreading.

Interestingly, I also found recently that the performance deteriorated when I added Threads.@threads to my main for loop. In my case, the solution was to switch off multithreading in BLAS by calling BLAS.set_num_threads(1). From that point on, I started to see a speed-up.
It would be useful if we could at least get a warning in such cases.