Intriguing performance comparison results.... threads, comprehensions, etc

Can anyone provide me with intuition for the poor performance of the dynamic threads for loop in the example below? (all second run times) Thanks!

using LinearAlgebra, .Threads


function test( Y, ::Val{1} )
    X = similar( Y )
    for n ∈ eachindex( X )
        X[n] = inv( Y[n] )
    end
    return X
end


function test( Y, ::Val{2} )
    return inv.( Y )
end



function test( Y, ::Val{3} )
    X = similar( Y )
    @threads :static for n ∈ eachindex( X )
        X[n] = inv( Y[n] )
    end
    return X
end


function test( Y, ::Val{4} )
    X = similar( Y )
    @threads :dynamic for n ∈ eachindex( X )
        X[n] = inv( Y[n] )
    end
    return X
end


function test( Y, ::Val{5} )
    return map( inv, Y )
end

function test( Y, :: Val{6} )
    return [ inv(y) for y ∈ Y ]
end


generate( N, n ) = [ randn( n, n ) for i ∈ 1:N ]


function runme( N, n )
    X = generate( N, n )
    for i ∈ 1:6
        @time test( X, Val( i ) )
    end
    for i ∈ 1:6
        print( "$i:   " )
        @time test( X, Val( i ) )
    end

end

How do you conclude that your results show a poor performance? What is your reference/ expected result?

What was Threads.nthreads()?

Method 4 takes more than twice as long as the other five methods.

  1. (#hyperthreads, difference smaller when using #physical cores, which I tend to use in practice}

Some things to note:

  • inv() calls out to BLAS and is already threaded. So the threaded options here will only add extra overhead.
  • Dynamic scheduling will have extra overhead. In cases where n is small and `N’ is especially large, this overhead will be proportionately larger.

In my own tests with (N=100, n=1000), (N=1000, n=1000), (N=1000000, n=100) I see basically all tests giving the same approximate runtime. What N,n were you using?

2 Likes

As torrance said, the BLAS function itself is threaded. If you use multiple Julia threads, the total threads have a complicated relationship with BLAS threads and Julia threads. If the total number of threads exceeds your CPU cores, you will experience poor performance.

This document clearly explains such an issue:

This one is for 1000, 1000.

Thanks @karei . Indeed. I’m just surprised by the chasm between dynamic threads and the other methods. I’m familiar with the ThreadPinning package, though I haven’t found that in my application there was a significant performance increase.

The reason I’m looking into this is performance degradation in my package from one Julia version to the next and the fact that now reducing the number of cores often improves runtime.

Less cores β†’ less cache pressure β†’ better performance
Less threads β†’ less thread switching overhead β†’ better performance

A high number of threads is only useful if:

  • your code does not allocate much, low GC overhead
  • the execution time within each thread is high
  • the memory needed for each thread is low
  • your CPU has lots of cache and/or a very high memory bandwidth
2 Likes

Thanks @ufechner7 . I understand all of that, but am still surprised by the 2.5 times performance cliff that only shows with dynamic threads. I had expected there to be a cost but didn’t realize creating / using extra threads was that costly.

If you need cheap threads, use GitHub - JuliaSIMD/Polyester.jl: The cheapest threads you can find!

An additional aspect is the fact that the CPU clock goes down the more CPU cores you use… This effect is very different depending on the CPU, though.

3 Likes