using LinearAlgebra
using SharedArrays
using BenchmarkTools
N = 50
m = 10
A = randn(m, N, N) |> SharedArray
@btime Threads.@threads for i ∈ 1:m
det(A[i, :, :])
end
@btime for i ∈ 1:m
det(A[i, :, :])
end
with
1.079 ms (122 allocations: 401.75 KiB)
5.232 ms (101 allocations: 398.94 KiB)
But for the original sizes I get:
9.166 s (933 allocations: 381.89 MiB)
901.518 ms (1001 allocations: 381.89 MiB)
So maybe some problem due to oversaturation of the cores?
using LinearAlgebra
using SharedArrays
using BenchmarkTools
function test1(tc)
N = 500
m = 100
A = randn(m, N, N) |> SharedArray
for i in 1:tc:m-tc
Threads.@threads for j ∈ i:i+tc-1
det(A[j, :, :])
end
end
end
function test2()
N = 500
m = 100
A = randn(m, N, N) |> SharedArray
for i ∈ 1:m
det(A[i, :, :])
end
end
@btime test1(1)
@btime test1(2)
@btime test1(5)
@btime test1(10)
@btime test2()
shows
1.265 s (3643 allocations: 569.10 MiB)
679.938 ms (2084 allocations: 565.13 MiB)
597.149 ms (1149 allocations: 553.58 MiB)
7.581 s (809 allocations: 534.46 MiB)
1.256 s (575 allocations: 572.62 MiB)
using LinearAlgebra
using BenchmarkTools
function test1(tc)
N = 500
m = 100
A = randn(m, N, N)
for i in 1:tc:m-tc
Threads.@threads for j ∈ i:i+tc-1
det(A[j, :, :])
end
end
end
function test2()
N = 500
m = 100
A = randn(m, N, N)
for i ∈ 1:m
det(A[i, :, :])
end
end
952.835 ms (3566 allocations: 569.10 MiB)
566.878 ms (2014 allocations: 565.13 MiB)
467.109 ms (1077 allocations: 553.58 MiB)
5.432 s (734 allocations: 534.46 MiB)
954.884 ms (502 allocations: 572.62 MiB)
Thanks for the tests! I fear something must be severely wrong wither with the server or my julia environment:
julia> test1(1)
3.523 s (20467 allocations: 571.43 MiB)
julia> test1(2)
12.826 s (10615 allocations: 566.19 MiB)
julia> test1(5)
19.033 s (4687 allocations: 553.89 MiB)
julia> test1(10)
11.294 s (2490 allocations: 534.52 MiB)
julia> test1(40)
9.206 s (889 allocations: 496.15 MiB)
julia> test2()
694.097 ms (502 allocations: 572.43 MiB)
Global scope cannot be the problem: I used @goerch 's examples for this and in my production code, where I observed these problems, everything is also nicely packed into functions.
@Oscar: negative. I could observe the problem with multiple tests I did today. det was just something I used for the MWE since it non-allocating and numerically “expensive”, i.e., provides a nice test case.
(Allocating examples had the obvious problems that 40 cores are fighting for memory and are thus inherently slow)
I now rerun the tests both on my local machine (v1.4.1) and on the remote machine with v1.4.2. In both cases I observe similar scaling as goerch. I tend to believe that the 1.5.3 installation on the remote machine must be broken.
Thanks to everyone’s help, I was starting to think that I am going crazy.
Try BLAS.set_num_threads(1). OpenBLAS’s performance on lu factorizations (used for det) degrades rapidly with additional cores. As in, using multiple threads uses more CPU while also making it take longer.
Anyway, if you want to improve performance, try
using LinearAlgebra
using BenchmarkTools
BLAS.set_num_threads(1)
function test1()
N = 500
m = 100
A = randn(N, N, m)
Threads.@threads for i ∈ 1:m
det(@view(A[:, :, i]))
end
end
function test2()
N = 500
m = 100
A = randn(N, N, m)
for i ∈ 1:m
det(@view(A[:, :, i]))
end
end