Questions about Polyester.jl

It this package work only with simple operations?
Can I use SpinLock() inside cycle?
Can I use variables declared outside cycle? (now I get UndefVarError: varname#661 not defined)
Actually, I try to get a sum of multiple inversed matrices.

:slight_smile:

I want to do something like this:

module A
using Polyester, LinearAlgebra
function f1()
n = 1000
m = 10
mv = Vector{Matrix}(undef, n)
sm = zeros(m, m)
sdet = zero(Float64)
l = Base.Threads.SpinLock()
@inbounds  @batch per=core for i = 1:n
   mv[i] =  rand(m,m)
   lock(l) do
       sm  += mv[i]
       sdet += det(mv[i])
   end
end
sdet, sm
end
function f2()
n = 1000
m = 10
mv = Vector{Matrix}(undef, n)
sm = zeros(m, m)
sdet = zero(Float64)
l = Base.Threads.SpinLock()
@inbounds Base.Threads.@threads  for i = 1:n
   mv[i] =  rand(m,m)
   lock(l) do
       sm  += mv[i]
       sdet += det(mv[i])
   end
end
sdet, sm
end
end

Now i have UndefVarError: sm not defined

Then:

julia> @time A.f1();
ERROR: UndefVarError: sm#1761 not defined
Stacktrace:
 [1] (::Main.A.var"#3#6"{Int64, Vector{Matrix{T} where T}})()
...
julia> @time A.f2();
  0.404839 seconds (6.17 k allocations: 2.771 MiB)

You could file an issue.
But, you’re better off with something like

using Polyester, LinearAlgebra
function f3()
  n = 1000
  m = 10
  sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
  sdets = Vector{Float64}(undef, Polyester.num_cores())
  
  d, r = divrem(n, Polyester.num_cores())
  @batch for i = 1:min(Polyester.num_cores(),n)
    offset = min(i-1, r) + (i-1)*d
    sm = rand(m, m)
    sdet = det(sm)
    for j ∈ 2:d+(i ≤ r)
      mvⱼ = rand(m, m)
      sm += mvⱼ
      sdet += det(mvⱼ)
    end
    sms[i] = sm
    sdets[i] = sdet
  end
  sum(sdets), sum(sms)
end

The approach here is to use separate accumualtors per thread instead of using a lock, and then finally do a single serial accumulation of the threaded results at the end.

Replacing mv = Vector{Matrix}(undef, n) with mv = Vector{Matrix{Float64}}(undef, n) in f2, I get:

julia> @benchmark f2()
BenchmarkTools.Trial: 738 samples with 1 evaluation.
 Range (min … max):  1.381 ms … 582.288 ms  ┊ GC (min … max):  0.00% … 99.34%
 Time  (median):     2.611 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   6.765 ms ±  25.475 ms  ┊ GC (mean ± σ):  18.88% ±  5.09%

         █
  ▂▂▃▁▂▄▇█▂▁▂▂▁▁▁▁▂▂▁▁▁▂▁▂▁▁▂▁▁▁▁▁▂▁▂▂▁▁▂▂▂▃▂▂▃▃▃▃▄▄▄▄▃▄▄▄▄▃▂ ▂
  1.38 ms         Histogram: frequency by time        9.87 ms <

 Memory estimate: 2.75 MiB, allocs estimate: 6147.

julia> @benchmark f3()
BenchmarkTools.Trial: 4309 samples with 1 evaluation.
 Range (min … max):   78.471 μs … 221.377 ms  ┊ GC (min … max):  0.00% … 83.76%
 Time  (median):     211.201 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.165 ms ±  14.120 ms  ┊ GC (mean ± σ):  74.77% ±  6.21%

  ▃▄                                         ▂▄▇██▇▅▂▁   ▁      ▂
  ██▇▆▆▆▇▄▁▁▁▃▁▁▆▅▁▃▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃▁▄▅▅▁▁▁▁▁▆███████████▇█████▇ █
  78.5 μs       Histogram: log(frequency) by time        251 μs <

 Memory estimate: 2.70 MiB, allocs estimate: 4002.

As with most threaded code (and all threaded code that allocates memory), the histogram is very useful in these benchmarks to get an idea of the time distribution. In contrast, the minimum (which would be reported by @btime) is very misleading; 78.5 microseconds isn’t really representative at all of the 1.165ms average, which is about 15x slower.

PS:
Note that this @inbounds doesn’t actually “work”. Currently, @inbounds does not penetrate the closure created by @threads, hence @inbounds does not apply to anything inside the loop.

@inbounds Base.Threads.@threads  for i = 1:n
8 Likes

If you want to optimize the code further, it’s best to cut down on allocations:

using VectorizedRNG
function f4()
  n = 1000
  m = 10
  sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
  sdets = Vector{Float64}(undef, Polyester.num_cores())
  
  d, r = divrem(n, Polyester.num_cores())
  @batch for i = 1:min(Polyester.num_cores(),n)
    offset = min(i-1, r) + (i-1)*d
    lrng = local_rng()
    sm = rand(lrng, m, m)
    mvⱼ = copy(sm)
    sdet = det(lu!(mvⱼ))
    for j ∈ 2:d+(i ≤ r)
      rand!(lrng, mvⱼ)
      sm .+= mvⱼ
      sdet += det(lu!(mvⱼ))
    end
    sms[i] = sm
    sdets[i] = sdet
  end
  sum(sdets), sum(sms)
end

Now I get:

julia> @benchmark f4()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   71.648 μs … 131.333 ms  ┊ GC (min … max):  0.00% … 72.99%
 Time  (median):      80.276 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   134.240 μs ±   2.404 ms  ┊ GC (mean ± σ):  37.13% ±  2.13%

              ▄▇█▇▅▃▃▃▅▅▅▄▂▁
  ▁▃▆▇▆▄▃▂▁▂▃▇███████████████▅▅▄▄▃▃▂▂▂▂▂▂▁▂▁▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁ ▃
  71.6 μs          Histogram: frequency by time         99.5 μs <

 Memory estimate: 176.89 KiB, allocs estimate: 1044.
4 Likes

And with a little more work…

using RecursiveFactorization
function f5()
  n = 1000
  m = 10
  sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
  sdets = Vector{Float64}(undef, Polyester.num_cores())
  
  d, r = divrem(n, Polyester.num_cores())
  @batch for i = 1:min(Polyester.num_cores(),n)
    offset = min(i-1, r) + (i-1)*d
    lrng = local_rng()
    sm = rand(lrng, m, m)
    mvⱼ = copy(sm)
    ipiv = Vector{LinearAlgebra.BlasInt}(undef, m)
    sdet = det(RecursiveFactorization.lu!(mvⱼ, ipiv, Val(true)))
    for j ∈ 2:d+(i ≤ r)
      rand!(lrng, mvⱼ)
      sm .+= mvⱼ
      sdet += det(RecursiveFactorization.lu!(mvⱼ, ipiv, Val(true)))
    end
    sms[i] = sm
    sdets[i] = sdet
  end
  sum(sdets), sum(sms)
end

Results:

julia> @benchmark f5()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  31.434 μs … 62.148 ms  ┊ GC (min … max):  0.00% … 43.27%
 Time  (median):     36.989 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   56.603 μs ±  1.071 ms  ┊ GC (mean ± σ):  14.10% ±  0.74%

   ▆▆▂▇   ▄▁█▆▃▁     ▁▂▁▂
  ▆█████▄▃██████▆▇▅▆██████▆▆▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  31.4 μs         Histogram: frequency by time        58.6 μs <

 Memory estimate: 38.23 KiB, allocs estimate: 58.
9 Likes

Thank you very much! Using separate accumulators per thread instead of using a lock is a great idea. My current function is a little more complicated, and I save all matrices too in vector (there is no rand() it just for example), so I need linear indexes too (no problem to calculate them).

One more observation - when I use the example above inside module - I have error , if in REPL - no error :laughing:

I implemented this approach and get 9ms from 14 (1.5 times faster). Thanks a lot!