Questions about Polyester.jl

It this package work only with simple operations?
Can I use SpinLock() inside cycle?
Can I use variables declared outside cycle? (now I get UndefVarError: varname#661 not defined)
Actually, I try to get a sum of multiple inversed matrices.

:slight_smile:

I want to do something like this:

module A
using Polyester, LinearAlgebra
function f1()
n = 1000
m = 10
mv = Vector{Matrix}(undef, n)
sm = zeros(m, m)
sdet = zero(Float64)
l = Base.Threads.SpinLock()
@inbounds  @batch per=core for i = 1:n
   mv[i] =  rand(m,m)
   lock(l) do
       sm  += mv[i]
       sdet += det(mv[i])
   end
end
sdet, sm
end
function f2()
n = 1000
m = 10
mv = Vector{Matrix}(undef, n)
sm = zeros(m, m)
sdet = zero(Float64)
l = Base.Threads.SpinLock()
@inbounds Base.Threads.@threads  for i = 1:n
   mv[i] =  rand(m,m)
   lock(l) do
       sm  += mv[i]
       sdet += det(mv[i])
   end
end
sdet, sm
end
end

Now i have UndefVarError: sm not defined

Then:

julia> @time A.f1();
ERROR: UndefVarError: sm#1761 not defined
Stacktrace:
 [1] (::Main.A.var"#3#6"{Int64, Vector{Matrix{T} where T}})()
...
julia> @time A.f2();
  0.404839 seconds (6.17 k allocations: 2.771 MiB)

You could file an issue.
But, you’re better off with something like

using Polyester, LinearAlgebra
function f3()
  n = 1000
  m = 10
  sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
  sdets = Vector{Float64}(undef, Polyester.num_cores())
  
  d, r = divrem(n, Polyester.num_cores())
  @batch for i = 1:min(Polyester.num_cores(),n)
    offset = min(i-1, r) + (i-1)*d
    sm = rand(m, m)
    sdet = det(sm)
    for j ∈ 2:d+(i ≀ r)
      mvβ±Ό = rand(m, m)
      sm += mvβ±Ό
      sdet += det(mvβ±Ό)
    end
    sms[i] = sm
    sdets[i] = sdet
  end
  sum(sdets), sum(sms)
end

The approach here is to use separate accumualtors per thread instead of using a lock, and then finally do a single serial accumulation of the threaded results at the end.

Replacing mv = Vector{Matrix}(undef, n) with mv = Vector{Matrix{Float64}}(undef, n) in f2, I get:

julia> @benchmark f2()
BenchmarkTools.Trial: 738 samples with 1 evaluation.
 Range (min … max):  1.381 ms … 582.288 ms  β”Š GC (min … max):  0.00% … 99.34%
 Time  (median):     2.611 ms               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   6.765 ms Β±  25.475 ms  β”Š GC (mean Β± Οƒ):  18.88% Β±  5.09%

         β–ˆ
  β–‚β–‚β–ƒβ–β–‚β–„β–‡β–ˆβ–‚β–β–‚β–‚β–β–β–β–β–‚β–‚β–β–β–β–‚β–β–‚β–β–β–‚β–β–β–β–β–β–‚β–β–‚β–‚β–β–β–‚β–‚β–‚β–ƒβ–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–„β–„β–ƒβ–„β–„β–„β–„β–ƒβ–‚ β–‚
  1.38 ms         Histogram: frequency by time        9.87 ms <

 Memory estimate: 2.75 MiB, allocs estimate: 6147.

julia> @benchmark f3()
BenchmarkTools.Trial: 4309 samples with 1 evaluation.
 Range (min … max):   78.471 ΞΌs … 221.377 ms  β”Š GC (min … max):  0.00% … 83.76%
 Time  (median):     211.201 ΞΌs               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):     1.165 ms Β±  14.120 ms  β”Š GC (mean Β± Οƒ):  74.77% Β±  6.21%

  β–ƒβ–„                                         β–‚β–„β–‡β–ˆβ–ˆβ–‡β–…β–‚β–   ▁      β–‚
  β–ˆβ–ˆβ–‡β–†β–†β–†β–‡β–„β–β–β–β–ƒβ–β–β–†β–…β–β–ƒβ–β–ƒβ–β–β–β–β–β–β–ƒβ–β–β–β–β–β–ƒβ–β–„β–…β–…β–β–β–β–β–β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡ β–ˆ
  78.5 ΞΌs       Histogram: log(frequency) by time        251 ΞΌs <

 Memory estimate: 2.70 MiB, allocs estimate: 4002.

As with most threaded code (and all threaded code that allocates memory), the histogram is very useful in these benchmarks to get an idea of the time distribution. In contrast, the minimum (which would be reported by @btime) is very misleading; 78.5 microseconds isn’t really representative at all of the 1.165ms average, which is about 15x slower.

PS:
Note that this @inbounds doesn’t actually β€œwork”. Currently, @inbounds does not penetrate the closure created by @threads, hence @inbounds does not apply to anything inside the loop.

@inbounds Base.Threads.@threads  for i = 1:n
8 Likes

If you want to optimize the code further, it’s best to cut down on allocations:

using VectorizedRNG
function f4()
  n = 1000
  m = 10
  sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
  sdets = Vector{Float64}(undef, Polyester.num_cores())
  
  d, r = divrem(n, Polyester.num_cores())
  @batch for i = 1:min(Polyester.num_cores(),n)
    offset = min(i-1, r) + (i-1)*d
    lrng = local_rng()
    sm = rand(lrng, m, m)
    mvβ±Ό = copy(sm)
    sdet = det(lu!(mvβ±Ό))
    for j ∈ 2:d+(i ≀ r)
      rand!(lrng, mvβ±Ό)
      sm .+= mvβ±Ό
      sdet += det(lu!(mvβ±Ό))
    end
    sms[i] = sm
    sdets[i] = sdet
  end
  sum(sdets), sum(sms)
end

Now I get:

julia> @benchmark f4()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   71.648 ΞΌs … 131.333 ms  β”Š GC (min … max):  0.00% … 72.99%
 Time  (median):      80.276 ΞΌs               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   134.240 ΞΌs Β±   2.404 ms  β”Š GC (mean Β± Οƒ):  37.13% Β±  2.13%

              β–„β–‡β–ˆβ–‡β–…β–ƒβ–ƒβ–ƒβ–…β–…β–…β–„β–‚β–
  β–β–ƒβ–†β–‡β–†β–„β–ƒβ–‚β–β–‚β–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–…β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–β–‚β–‚β–‚β–β–β–‚β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  71.6 ΞΌs          Histogram: frequency by time         99.5 ΞΌs <

 Memory estimate: 176.89 KiB, allocs estimate: 1044.
4 Likes

And with a little more work…

using RecursiveFactorization
function f5()
  n = 1000
  m = 10
  sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
  sdets = Vector{Float64}(undef, Polyester.num_cores())
  
  d, r = divrem(n, Polyester.num_cores())
  @batch for i = 1:min(Polyester.num_cores(),n)
    offset = min(i-1, r) + (i-1)*d
    lrng = local_rng()
    sm = rand(lrng, m, m)
    mvβ±Ό = copy(sm)
    ipiv = Vector{LinearAlgebra.BlasInt}(undef, m)
    sdet = det(RecursiveFactorization.lu!(mvβ±Ό, ipiv, Val(true)))
    for j ∈ 2:d+(i ≀ r)
      rand!(lrng, mvβ±Ό)
      sm .+= mvβ±Ό
      sdet += det(RecursiveFactorization.lu!(mvβ±Ό, ipiv, Val(true)))
    end
    sms[i] = sm
    sdets[i] = sdet
  end
  sum(sdets), sum(sms)
end

Results:

julia> @benchmark f5()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  31.434 ΞΌs … 62.148 ms  β”Š GC (min … max):  0.00% … 43.27%
 Time  (median):     36.989 ΞΌs              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   56.603 ΞΌs Β±  1.071 ms  β”Š GC (mean Β± Οƒ):  14.10% Β±  0.74%

   β–†β–†β–‚β–‡   β–„β–β–ˆβ–†β–ƒβ–     ▁▂▁▂
  β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  31.4 ΞΌs         Histogram: frequency by time        58.6 ΞΌs <

 Memory estimate: 38.23 KiB, allocs estimate: 58.
9 Likes

Thank you very much! Using separate accumulators per thread instead of using a lock is a great idea. My current function is a little more complicated, and I save all matrices too in vector (there is no rand() it just for example), so I need linear indexes too (no problem to calculate them).

One more observation - when I use the example above inside module - I have error , if in REPL - no error :laughing:

I implemented this approach and get 9ms from 14 (1.5 times faster). Thanks a lot!