Loosing performance with `Threads.@threads` for loop

I have strange performance issues when I use Threads.@threads to parallelize a for-loop. My machine runs on 40 cores:

julia> Threads.nthreads()
40

Now, I lose performance even with a quite simple example:

N = 500
m = 100
A = randn(m, N, N) |> SharedArray
    
@time Threads.@threads for i ∈ 1:m
    det(A[i, :, :])
end

takes about 12-13 seconds and allocates 382.813 MiB.

This should be compared to

@time for i ∈ 1:m
    det(A[i, :, :])
end

which runs in a little under 0.5 seconds and allocates 381.693 MiB.

The difference in the allocated memory is primarily due to spawning the threads as

@time Threads.@threads for i ∈ 1:m
    1+1
end

allocates 1010.651 KiB, i.e., approximatively the difference.

Am I doing something very obvious very wrong or should I throw my server out of the window :smiley: ?

Julia version is 1.5.3

Looks fine for me on 1.6.3

using LinearAlgebra
using SharedArrays
using BenchmarkTools

N = 50
m = 10
A = randn(m, N, N) |> SharedArray

@btime Threads.@threads for i ∈ 1:m
    det(A[i, :, :])
end

@btime for i ∈ 1:m
    det(A[i, :, :])
end

with

  1.079 ms (122 allocations: 401.75 KiB)
  5.232 ms (101 allocations: 398.94 KiB)

But for the original sizes I get:

  9.166 s (933 allocations: 381.89 MiB)
  901.518 ms (1001 allocations: 381.89 MiB)

So maybe some problem due to oversaturation of the cores?

Looks like it:

using LinearAlgebra
using SharedArrays
using BenchmarkTools

function test1(tc)
    N = 500
    m = 100
    A = randn(m, N, N) |> SharedArray

    for i in 1:tc:m-tc
        Threads.@threads for j ∈ i:i+tc-1
            det(A[j, :, :])
        end
    end
end

function test2()
    N = 500
    m = 100
    A = randn(m, N, N) |> SharedArray

    for i ∈ 1:m
        det(A[i, :, :])
    end
end

@btime test1(1)
@btime test1(2)
@btime test1(5)
@btime test1(10)
@btime test2()

shows

  1.265 s (3643 allocations: 569.10 MiB)
  679.938 ms (2084 allocations: 565.13 MiB)
  597.149 ms (1149 allocations: 553.58 MiB)
  7.581 s (809 allocations: 534.46 MiB)
  1.256 s (575 allocations: 572.62 MiB)

with 6 cores.

you don’t need a SharedArray because threads share memory already.

your original post suffers from the fact that A is a global, non-constant variable, compiler can’t optimize that

Thanks Jerry,

updated code and numbers:

using LinearAlgebra
using BenchmarkTools

function test1(tc)
    N = 500
    m = 100
    A = randn(m, N, N)

    for i in 1:tc:m-tc
        Threads.@threads for j ∈ i:i+tc-1
            det(A[j, :, :])
        end
    end
end

function test2()
    N = 500
    m = 100
    A = randn(m, N, N) 

    for i ∈ 1:m
        det(A[i, :, :])
    end
end
  952.835 ms (3566 allocations: 569.10 MiB)
  566.878 ms (2014 allocations: 565.13 MiB)
  467.109 ms (1077 allocations: 553.58 MiB)
  5.432 s (734 allocations: 534.46 MiB)
  954.884 ms (502 allocations: 572.62 MiB)

Thanks for the tests! I fear something must be severely wrong wither with the server or my julia environment:

julia> test1(1)
3.523 s (20467 allocations: 571.43 MiB)
julia> test1(2)
12.826 s (10615 allocations: 566.19 MiB)
julia> test1(5)
19.033 s (4687 allocations: 553.89 MiB)
julia> test1(10)
11.294 s (2490 allocations: 534.52 MiB)
julia> test1(40)
9.206 s (889 allocations: 496.15 MiB)
julia> test2()
694.097 ms (502 allocations: 572.43 MiB)

Global scope cannot be the problem: I used @goerch 's examples for this and in my production code, where I observed these problems, everything is also nicely packed into functions.

Hm. More suspects: either your machine is busy or this is due to differences between 1.5.3 and 1.6.3.

isn’t the problem that det uses blas, and do is already threaded?

@Oscar: negative. I could observe the problem with multiple tests I did today. det was just something I used for the MWE since it non-allocating and numerically “expensive”, i.e., provides a nice test case.
(Allocating examples had the obvious problems that 40 cores are fighting for memory and are thus inherently slow)


I now rerun the tests both on my local machine (v1.4.1) and on the remote machine with v1.4.2. In both cases I observe similar scaling as goerch. I tend to believe that the 1.5.3 installation on the remote machine must be broken.

Thanks to everyone’s help, I was starting to think that I am going crazy.

Try BLAS.set_num_threads(1). OpenBLAS’s performance on lu factorizations (used for det) degrades rapidly with additional cores. As in, using multiple threads uses more CPU while also making it take longer.

Anyway, if you want to improve performance, try

using LinearAlgebra
using BenchmarkTools
BLAS.set_num_threads(1)

function test1()
    N = 500
    m = 100
    A = randn(N, N, m)

    Threads.@threads for i ∈ 1:m
        det(@view(A[:, :, i]))
    end
end

function test2()
    N = 500
    m = 100
    A = randn(N, N, m) 

    for i ∈ 1:m
        det(@view(A[:, :, i]))
    end
end
4 Likes

Here are my numbers with BLAS.set_num_threads(1) for

@btime test1(1)
@btime test1(2)
@btime test1(5)
@btime test1(10)
@btime test1(20)
@btime test2()
  1.139 s (3566 allocations: 569.10 MiB)
  614.184 ms (2029 allocations: 565.13 MiB)
  308.655 ms (1080 allocations: 553.58 MiB)
  302.716 ms (738 allocations: 534.46 MiB)
  279.247 ms (526 allocations: 496.25 MiB)
  1.141 s (502 allocations: 572.62 MiB)

Scales way better!