It this package work only with simple operations?
Can I use SpinLock() inside cycle?
Can I use variables declared outside cycle? (now I get UndefVarError: varname#661 not defined)
Actually, I try to get a sum of multiple inversed matrices.
I want to do something like this:
module A
using Polyester, LinearAlgebra
function f1()
n = 1000
m = 10
mv = Vector{Matrix}(undef, n)
sm = zeros(m, m)
sdet = zero(Float64)
l = Base.Threads.SpinLock()
@inbounds @batch per=core for i = 1:n
mv[i] = rand(m,m)
lock(l) do
sm += mv[i]
sdet += det(mv[i])
end
end
sdet, sm
end
function f2()
n = 1000
m = 10
mv = Vector{Matrix}(undef, n)
sm = zeros(m, m)
sdet = zero(Float64)
l = Base.Threads.SpinLock()
@inbounds Base.Threads.@threads for i = 1:n
mv[i] = rand(m,m)
lock(l) do
sm += mv[i]
sdet += det(mv[i])
end
end
sdet, sm
end
end
Now i have UndefVarError: sm not defined
Then:
julia> @time A.f1();
ERROR: UndefVarError: sm#1761 not defined
Stacktrace:
[1] (::Main.A.var"#3#6"{Int64, Vector{Matrix{T} where T}})()
...
julia> @time A.f2();
0.404839 seconds (6.17 k allocations: 2.771 MiB)
You could file an issue.
But, youβre better off with something like
using Polyester, LinearAlgebra
function f3()
n = 1000
m = 10
sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
sdets = Vector{Float64}(undef, Polyester.num_cores())
d, r = divrem(n, Polyester.num_cores())
@batch for i = 1:min(Polyester.num_cores(),n)
offset = min(i-1, r) + (i-1)*d
sm = rand(m, m)
sdet = det(sm)
for j β 2:d+(i β€ r)
mvβ±Ό = rand(m, m)
sm += mvβ±Ό
sdet += det(mvβ±Ό)
end
sms[i] = sm
sdets[i] = sdet
end
sum(sdets), sum(sms)
end
The approach here is to use separate accumualtors per thread instead of using a lock, and then finally do a single serial accumulation of the threaded results at the end.
Replacing mv = Vector{Matrix}(undef, n)
with mv = Vector{Matrix{Float64}}(undef, n)
in f2
, I get:
julia> @benchmark f2()
BenchmarkTools.Trial: 738 samples with 1 evaluation.
Range (min β¦ max): 1.381 ms β¦ 582.288 ms β GC (min β¦ max): 0.00% β¦ 99.34%
Time (median): 2.611 ms β GC (median): 0.00%
Time (mean Β± Ο): 6.765 ms Β± 25.475 ms β GC (mean Β± Ο): 18.88% Β± 5.09%
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.38 ms Histogram: frequency by time 9.87 ms <
Memory estimate: 2.75 MiB, allocs estimate: 6147.
julia> @benchmark f3()
BenchmarkTools.Trial: 4309 samples with 1 evaluation.
Range (min β¦ max): 78.471 ΞΌs β¦ 221.377 ms β GC (min β¦ max): 0.00% β¦ 83.76%
Time (median): 211.201 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 1.165 ms Β± 14.120 ms β GC (mean Β± Ο): 74.77% Β± 6.21%
ββ βββββββ
ββ β β
ββββββββββββββββ
ββββββββββββββββββββ
β
ββββββββββββββββββββββββ β
78.5 ΞΌs Histogram: log(frequency) by time 251 ΞΌs <
Memory estimate: 2.70 MiB, allocs estimate: 4002.
As with most threaded code (and all threaded code that allocates memory), the histogram is very useful in these benchmarks to get an idea of the time distribution. In contrast, the minimum (which would be reported by @btime
) is very misleading; 78.5 microseconds isnβt really representative at all of the 1.165ms average, which is about 15x slower.
PS:
Note that this @inbounds
doesnβt actually βworkβ. Currently, @inbounds
does not penetrate the closure created by @threads
, hence @inbounds
does not apply to anything inside the loop.
@inbounds Base.Threads.@threads for i = 1:n
If you want to optimize the code further, itβs best to cut down on allocations:
using VectorizedRNG
function f4()
n = 1000
m = 10
sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
sdets = Vector{Float64}(undef, Polyester.num_cores())
d, r = divrem(n, Polyester.num_cores())
@batch for i = 1:min(Polyester.num_cores(),n)
offset = min(i-1, r) + (i-1)*d
lrng = local_rng()
sm = rand(lrng, m, m)
mvβ±Ό = copy(sm)
sdet = det(lu!(mvβ±Ό))
for j β 2:d+(i β€ r)
rand!(lrng, mvβ±Ό)
sm .+= mvβ±Ό
sdet += det(lu!(mvβ±Ό))
end
sms[i] = sm
sdets[i] = sdet
end
sum(sdets), sum(sms)
end
Now I get:
julia> @benchmark f4()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 71.648 ΞΌs β¦ 131.333 ms β GC (min β¦ max): 0.00% β¦ 72.99%
Time (median): 80.276 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 134.240 ΞΌs Β± 2.404 ms β GC (mean Β± Ο): 37.13% Β± 2.13%
βββββ
ββββ
β
β
βββ
ββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββ β
71.6 ΞΌs Histogram: frequency by time 99.5 ΞΌs <
Memory estimate: 176.89 KiB, allocs estimate: 1044.
And with a little more workβ¦
using RecursiveFactorization
function f5()
n = 1000
m = 10
sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
sdets = Vector{Float64}(undef, Polyester.num_cores())
d, r = divrem(n, Polyester.num_cores())
@batch for i = 1:min(Polyester.num_cores(),n)
offset = min(i-1, r) + (i-1)*d
lrng = local_rng()
sm = rand(lrng, m, m)
mvβ±Ό = copy(sm)
ipiv = Vector{LinearAlgebra.BlasInt}(undef, m)
sdet = det(RecursiveFactorization.lu!(mvβ±Ό, ipiv, Val(true)))
for j β 2:d+(i β€ r)
rand!(lrng, mvβ±Ό)
sm .+= mvβ±Ό
sdet += det(RecursiveFactorization.lu!(mvβ±Ό, ipiv, Val(true)))
end
sms[i] = sm
sdets[i] = sdet
end
sum(sdets), sum(sms)
end
Results:
julia> @benchmark f5()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 31.434 ΞΌs β¦ 62.148 ms β GC (min β¦ max): 0.00% β¦ 43.27%
Time (median): 36.989 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 56.603 ΞΌs Β± 1.071 ms β GC (mean Β± Ο): 14.10% Β± 0.74%
ββββ ββββββ ββββ
βββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββ β
31.4 ΞΌs Histogram: frequency by time 58.6 ΞΌs <
Memory estimate: 38.23 KiB, allocs estimate: 58.
Thank you very much! Using separate accumulators per thread instead of using a lock is a great idea. My current function is a little more complicated, and I save all matrices too in vector (there is no rand() it just for example), so I need linear indexes too (no problem to calculate them).
One more observation - when I use the example above inside module - I have error , if in REPL - no error
I implemented this approach and get 9ms from 14 (1.5 times faster). Thanks a lot!