Can anyone provide me with intuition for the poor performance of the dynamic threads for loop in the example below? (all second run times) Thanks!
using LinearAlgebra, .Threads
function test( Y, ::Val{1} )
X = similar( Y )
for n β eachindex( X )
X[n] = inv( Y[n] )
end
return X
end
function test( Y, ::Val{2} )
return inv.( Y )
end
function test( Y, ::Val{3} )
X = similar( Y )
@threads :static for n β eachindex( X )
X[n] = inv( Y[n] )
end
return X
end
function test( Y, ::Val{4} )
X = similar( Y )
@threads :dynamic for n β eachindex( X )
X[n] = inv( Y[n] )
end
return X
end
function test( Y, ::Val{5} )
return map( inv, Y )
end
function test( Y, :: Val{6} )
return [ inv(y) for y β Y ]
end
generate( N, n ) = [ randn( n, n ) for i β 1:N ]
function runme( N, n )
X = generate( N, n )
for i β 1:6
@time test( X, Val( i ) )
end
for i β 1:6
print( "$i: " )
@time test( X, Val( i ) )
end
end
inv() calls out to BLAS and is already threaded. So the threaded options here will only add extra overhead.
Dynamic scheduling will have extra overhead. In cases where n is small and `Nβ is especially large, this overhead will be proportionately larger.
In my own tests with (N=100, n=1000), (N=1000, n=1000), (N=1000000, n=100) I see basically all tests giving the same approximate runtime. What N,n were you using?
As torrance said, the BLAS function itself is threaded. If you use multiple Julia threads, the total threads have a complicated relationship with BLAS threads and Julia threads. If the total number of threads exceeds your CPU cores, you will experience poor performance.
Thanks @karei . Indeed. Iβm just surprised by the chasm between dynamic threads and the other methods. Iβm familiar with the ThreadPinning package, though I havenβt found that in my application there was a significant performance increase.
The reason Iβm looking into this is performance degradation in my package from one Julia version to the next and the fact that now reducing the number of cores often improves runtime.
Thanks @ufechner7 . I understand all of that, but am still surprised by the 2.5 times performance cliff that only shows with dynamic threads. I had expected there to be a cost but didnβt realize creating / using extra threads was that costly.
An additional aspect is the fact that the CPU clock goes down the more CPU cores you use⦠This effect is very different depending on the CPU, though.