Loosing performance with `Threads.@threads` for loop

Kevin_Kleinbeck · October 7, 2021, 10:17am

I have strange performance issues when I use Threads.@threads to parallelize a for-loop. My machine runs on 40 cores:

julia> Threads.nthreads()
40

Now, I lose performance even with a quite simple example:

N = 500
m = 100
A = randn(m, N, N) |> SharedArray
    
@time Threads.@threads for i ∈ 1:m
    det(A[i, :, :])
end

takes about 12-13 seconds and allocates 382.813 MiB.

This should be compared to

@time for i ∈ 1:m
    det(A[i, :, :])
end

which runs in a little under 0.5 seconds and allocates 381.693 MiB.

The difference in the allocated memory is primarily due to spawning the threads as

@time Threads.@threads for i ∈ 1:m
    1+1
end

allocates 1010.651 KiB, i.e., approximatively the difference.

Am I doing something very obvious very wrong or should I throw my server out of the window ?

Julia version is 1.5.3

goerch · October 7, 2021, 11:02am

Looks fine for me on 1.6.3

using LinearAlgebra
using SharedArrays
using BenchmarkTools

N = 50
m = 10
A = randn(m, N, N) |> SharedArray

@btime Threads.@threads for i ∈ 1:m
    det(A[i, :, :])
end

@btime for i ∈ 1:m
    det(A[i, :, :])
end

with

  1.079 ms (122 allocations: 401.75 KiB)
  5.232 ms (101 allocations: 398.94 KiB)

But for the original sizes I get:

  9.166 s (933 allocations: 381.89 MiB)
  901.518 ms (1001 allocations: 381.89 MiB)

So maybe some problem due to oversaturation of the cores?

goerch · October 7, 2021, 11:31am

Looks like it:

using LinearAlgebra
using SharedArrays
using BenchmarkTools

function test1(tc)
    N = 500
    m = 100
    A = randn(m, N, N) |> SharedArray

    for i in 1:tc:m-tc
        Threads.@threads for j ∈ i:i+tc-1
            det(A[j, :, :])
        end
    end
end

function test2()
    N = 500
    m = 100
    A = randn(m, N, N) |> SharedArray

    for i ∈ 1:m
        det(A[i, :, :])
    end
end

@btime test1(1)
@btime test1(2)
@btime test1(5)
@btime test1(10)
@btime test2()

shows

  1.265 s (3643 allocations: 569.10 MiB)
  679.938 ms (2084 allocations: 565.13 MiB)
  597.149 ms (1149 allocations: 553.58 MiB)
  7.581 s (809 allocations: 534.46 MiB)
  1.256 s (575 allocations: 572.62 MiB)

with 6 cores.

jling · October 7, 2021, 11:59am

you don’t need a SharedArray because threads share memory already.

your original post suffers from the fact that A is a global, non-constant variable, compiler can’t optimize that

goerch · October 7, 2021, 12:06pm

Thanks Jerry,

updated code and numbers:

using LinearAlgebra
using BenchmarkTools

function test1(tc)
    N = 500
    m = 100
    A = randn(m, N, N)

    for i in 1:tc:m-tc
        Threads.@threads for j ∈ i:i+tc-1
            det(A[j, :, :])
        end
    end
end

function test2()
    N = 500
    m = 100
    A = randn(m, N, N) 

    for i ∈ 1:m
        det(A[i, :, :])
    end
end

  952.835 ms (3566 allocations: 569.10 MiB)
  566.878 ms (2014 allocations: 565.13 MiB)
  467.109 ms (1077 allocations: 553.58 MiB)
  5.432 s (734 allocations: 534.46 MiB)
  954.884 ms (502 allocations: 572.62 MiB)

Kevin_Kleinbeck · October 7, 2021, 12:41pm

Thanks for the tests! I fear something must be severely wrong wither with the server or my julia environment:

julia> test1(1)
3.523 s (20467 allocations: 571.43 MiB)

julia> test1(2)
12.826 s (10615 allocations: 566.19 MiB)

julia> test1(5)
19.033 s (4687 allocations: 553.89 MiB)

julia> test1(10)
11.294 s (2490 allocations: 534.52 MiB)

julia> test1(40)
9.206 s (889 allocations: 496.15 MiB)

julia> test2()
694.097 ms (502 allocations: 572.43 MiB)

Global scope cannot be the problem: I used @goerch 's examples for this and in my production code, where I observed these problems, everything is also nicely packed into functions.

goerch · October 7, 2021, 12:47pm

Hm. More suspects: either your machine is busy or this is due to differences between 1.5.3 and 1.6.3.

Oscar_Smith · October 7, 2021, 1:10pm

isn’t the problem that det uses blas, and do is already threaded?

Kevin_Kleinbeck · October 7, 2021, 1:19pm

@Oscar: negative. I could observe the problem with multiple tests I did today. det was just something I used for the MWE since it non-allocating and numerically “expensive”, i.e., provides a nice test case.
(Allocating examples had the obvious problems that 40 cores are fighting for memory and are thus inherently slow)

I now rerun the tests both on my local machine (v1.4.1) and on the remote machine with v1.4.2. In both cases I observe similar scaling as goerch. I tend to believe that the 1.5.3 installation on the remote machine must be broken.

Thanks to everyone’s help, I was starting to think that I am going crazy.

Elrod · October 7, 2021, 2:23pm

Try BLAS.set_num_threads(1). OpenBLAS’s performance on lu factorizations (used for det) degrades rapidly with additional cores. As in, using multiple threads uses more CPU while also making it take longer.

Anyway, if you want to improve performance, try

using LinearAlgebra
using BenchmarkTools
BLAS.set_num_threads(1)

function test1()
    N = 500
    m = 100
    A = randn(N, N, m)

    Threads.@threads for i ∈ 1:m
        det(@view(A[:, :, i]))
    end
end

function test2()
    N = 500
    m = 100
    A = randn(N, N, m) 

    for i ∈ 1:m
        det(@view(A[:, :, i]))
    end
end

goerch · October 7, 2021, 2:36pm

Here are my numbers with BLAS.set_num_threads(1) for

@btime test1(1)
@btime test1(2)
@btime test1(5)
@btime test1(10)
@btime test1(20)
@btime test2()

  1.139 s (3566 allocations: 569.10 MiB)
  614.184 ms (2029 allocations: 565.13 MiB)
  308.655 ms (1080 allocations: 553.58 MiB)
  302.716 ms (738 allocations: 534.46 MiB)
  279.247 ms (526 allocations: 496.25 MiB)
  1.141 s (502 allocations: 572.62 MiB)

Scales way better!

Topic		Replies	Views
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1961	January 20, 2023
Embarrassingly parallel multi-threading doesn't scale Performance multithreading	17	1615	October 16, 2021
Large memory allocation when a loop is threaded, none when run single-threaded General Usage multithreading , memory-allocation	12	444	October 11, 2021
Threads.@threads does not work properly General Usage	6	310	July 7, 2024
@threads for loop performance Performance	6	710	December 11, 2020

Loosing performance with `Threads.@threads` for loop

Related topics