Intriguing performance comparison results.... threads, comprehensions, etc

Joris_Pinkse · January 23, 2025, 12:37am

Can anyone provide me with intuition for the poor performance of the dynamic threads for loop in the example below? (all second run times) Thanks!

using LinearAlgebra, .Threads


function test( Y, ::Val{1} )
    X = similar( Y )
    for n ∈ eachindex( X )
        X[n] = inv( Y[n] )
    end
    return X
end


function test( Y, ::Val{2} )
    return inv.( Y )
end



function test( Y, ::Val{3} )
    X = similar( Y )
    @threads :static for n ∈ eachindex( X )
        X[n] = inv( Y[n] )
    end
    return X
end


function test( Y, ::Val{4} )
    X = similar( Y )
    @threads :dynamic for n ∈ eachindex( X )
        X[n] = inv( Y[n] )
    end
    return X
end


function test( Y, ::Val{5} )
    return map( inv, Y )
end

function test( Y, :: Val{6} )
    return [ inv(y) for y ∈ Y ]
end


generate( N, n ) = [ randn( n, n ) for i ∈ 1:N ]


function runme( N, n )
    X = generate( N, n )
    for i ∈ 1:6
        @time test( X, Val( i ) )
    end
    for i ∈ 1:6
        print( "$i:   " )
        @time test( X, Val( i ) )
    end

end

ufechner7 · January 23, 2025, 2:19am

How do you conclude that your results show a poor performance? What is your reference/ expected result?

Benny · January 23, 2025, 2:52am

What was Threads.nthreads()?

Joris_Pinkse · January 23, 2025, 3:00am

Method 4 takes more than twice as long as the other five methods.

Joris_Pinkse · January 23, 2025, 3:02am

(#hyperthreads, difference smaller when using #physical cores, which I tend to use in practice}

torrance · January 23, 2025, 4:30am

Some things to note:

inv() calls out to BLAS and is already threaded. So the threaded options here will only add extra overhead.
Dynamic scheduling will have extra overhead. In cases where n is small and `N’ is especially large, this overhead will be proportionately larger.

In my own tests with (N=100, n=1000), (N=1000, n=1000), (N=1000000, n=100) I see basically all tests giving the same approximate runtime. What N,n were you using?

karei · January 23, 2025, 6:30am

As torrance said, the BLAS function itself is threaded. If you use multiple Julia threads, the total threads have a complicated relationship with BLAS threads and Julia threads. If the total number of threads exceeds your CPU cores, you will experience poor performance.

This document clearly explains such an issue:

Joris_Pinkse · January 23, 2025, 3:02pm

This one is for 1000, 1000.

Joris_Pinkse · January 23, 2025, 3:05pm

Thanks @karei . Indeed. I’m just surprised by the chasm between dynamic threads and the other methods. I’m familiar with the ThreadPinning package, though I haven’t found that in my application there was a significant performance increase.

The reason I’m looking into this is performance degradation in my package from one Julia version to the next and the fact that now reducing the number of cores often improves runtime.

ufechner7 · January 23, 2025, 5:48pm

Less cores → less cache pressure → better performance
Less threads → less thread switching overhead → better performance

A high number of threads is only useful if:

your code does not allocate much, low GC overhead
the execution time within each thread is high
the memory needed for each thread is low
your CPU has lots of cache and/or a very high memory bandwidth

Joris_Pinkse · January 23, 2025, 6:58pm

Thanks @ufechner7 . I understand all of that, but am still surprised by the 2.5 times performance cliff that only shows with dynamic threads. I had expected there to be a cost but didn’t realize creating / using extra threads was that costly.

ufechner7 · January 23, 2025, 7:03pm

If you need cheap threads, use GitHub - JuliaSIMD/Polyester.jl: The cheapest threads you can find!

An additional aspect is the fact that the CPU clock goes down the more CPU cores you use… This effect is very different depending on the CPU, though.

Topic		Replies	Views
Loosing performance with `Threads.@threads` for loop Performance parallel , multithreading , threads	10	703	October 7, 2021
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	411	November 13, 2023
@threads for loop performance Performance	6	710	December 11, 2020
What's the problem with this simple multi-thread code? General Usage question	17	1145	March 11, 2022
Simple multi-thread loop with array Performance question , parallel , multithreading	11	762	April 13, 2021

Intriguing performance comparison results.... threads, comprehensions, etc

Related topics