Questions about Polyester.jl

PharmCat · September 18, 2021, 12:19pm

It this package work only with simple operations?
Can I use SpinLock() inside cycle?
Can I use variables declared outside cycle? (now I get UndefVarError: varname#661 not defined)
Actually, I try to get a sum of multiple inversed matrices.

John_Gibson · September 18, 2021, 1:42pm

PharmCat · September 18, 2021, 5:32pm

I want to do something like this:

module A
using Polyester, LinearAlgebra
function f1()
n = 1000
m = 10
mv = Vector{Matrix}(undef, n)
sm = zeros(m, m)
sdet = zero(Float64)
l = Base.Threads.SpinLock()
@inbounds  @batch per=core for i = 1:n
   mv[i] =  rand(m,m)
   lock(l) do
       sm  += mv[i]
       sdet += det(mv[i])
   end
end
sdet, sm
end
function f2()
n = 1000
m = 10
mv = Vector{Matrix}(undef, n)
sm = zeros(m, m)
sdet = zero(Float64)
l = Base.Threads.SpinLock()
@inbounds Base.Threads.@threads  for i = 1:n
   mv[i] =  rand(m,m)
   lock(l) do
       sm  += mv[i]
       sdet += det(mv[i])
   end
end
sdet, sm
end
end

Now i have UndefVarError: sm not defined

Then:

julia> @time A.f1();
ERROR: UndefVarError: sm#1761 not defined
Stacktrace:
 [1] (::Main.A.var"#3#6"{Int64, Vector{Matrix{T} where T}})()
...
julia> @time A.f2();
  0.404839 seconds (6.17 k allocations: 2.771 MiB)

Elrod · September 18, 2021, 8:21pm

You could file an issue.
But, you’re better off with something like

using Polyester, LinearAlgebra
function f3()
  n = 1000
  m = 10
  sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
  sdets = Vector{Float64}(undef, Polyester.num_cores())
  
  d, r = divrem(n, Polyester.num_cores())
  @batch for i = 1:min(Polyester.num_cores(),n)
    offset = min(i-1, r) + (i-1)*d
    sm = rand(m, m)
    sdet = det(sm)
    for j ∈ 2:d+(i ≤ r)
      mvⱼ = rand(m, m)
      sm += mvⱼ
      sdet += det(mvⱼ)
    end
    sms[i] = sm
    sdets[i] = sdet
  end
  sum(sdets), sum(sms)
end

The approach here is to use separate accumualtors per thread instead of using a lock, and then finally do a single serial accumulation of the threaded results at the end.

Replacing mv = Vector{Matrix}(undef, n) with mv = Vector{Matrix{Float64}}(undef, n) in f2, I get:

julia> @benchmark f2()
BenchmarkTools.Trial: 738 samples with 1 evaluation.
 Range (min … max):  1.381 ms … 582.288 ms  ┊ GC (min … max):  0.00% … 99.34%
 Time  (median):     2.611 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   6.765 ms ±  25.475 ms  ┊ GC (mean ± σ):  18.88% ±  5.09%

         █
  ▂▂▃▁▂▄▇█▂▁▂▂▁▁▁▁▂▂▁▁▁▂▁▂▁▁▂▁▁▁▁▁▂▁▂▂▁▁▂▂▂▃▂▂▃▃▃▃▄▄▄▄▃▄▄▄▄▃▂ ▂
  1.38 ms         Histogram: frequency by time        9.87 ms <

 Memory estimate: 2.75 MiB, allocs estimate: 6147.

julia> @benchmark f3()
BenchmarkTools.Trial: 4309 samples with 1 evaluation.
 Range (min … max):   78.471 μs … 221.377 ms  ┊ GC (min … max):  0.00% … 83.76%
 Time  (median):     211.201 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.165 ms ±  14.120 ms  ┊ GC (mean ± σ):  74.77% ±  6.21%

  ▃▄                                         ▂▄▇██▇▅▂▁   ▁      ▂
  ██▇▆▆▆▇▄▁▁▁▃▁▁▆▅▁▃▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃▁▄▅▅▁▁▁▁▁▆███████████▇█████▇ █
  78.5 μs       Histogram: log(frequency) by time        251 μs <

 Memory estimate: 2.70 MiB, allocs estimate: 4002.

As with most threaded code (and all threaded code that allocates memory), the histogram is very useful in these benchmarks to get an idea of the time distribution. In contrast, the minimum (which would be reported by @btime) is very misleading; 78.5 microseconds isn’t really representative at all of the 1.165ms average, which is about 15x slower.

PS:
Note that this @inbounds doesn’t actually “work”. Currently, @inbounds does not penetrate the closure created by @threads, hence @inbounds does not apply to anything inside the loop.

@inbounds Base.Threads.@threads  for i = 1:n

Elrod · September 18, 2021, 8:29pm

If you want to optimize the code further, it’s best to cut down on allocations:

using VectorizedRNG
function f4()
  n = 1000
  m = 10
  sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
  sdets = Vector{Float64}(undef, Polyester.num_cores())
  
  d, r = divrem(n, Polyester.num_cores())
  @batch for i = 1:min(Polyester.num_cores(),n)
    offset = min(i-1, r) + (i-1)*d
    lrng = local_rng()
    sm = rand(lrng, m, m)
    mvⱼ = copy(sm)
    sdet = det(lu!(mvⱼ))
    for j ∈ 2:d+(i ≤ r)
      rand!(lrng, mvⱼ)
      sm .+= mvⱼ
      sdet += det(lu!(mvⱼ))
    end
    sms[i] = sm
    sdets[i] = sdet
  end
  sum(sdets), sum(sms)
end

Now I get:

julia> @benchmark f4()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   71.648 μs … 131.333 ms  ┊ GC (min … max):  0.00% … 72.99%
 Time  (median):      80.276 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   134.240 μs ±   2.404 ms  ┊ GC (mean ± σ):  37.13% ±  2.13%

              ▄▇█▇▅▃▃▃▅▅▅▄▂▁
  ▁▃▆▇▆▄▃▂▁▂▃▇███████████████▅▅▄▄▃▃▂▂▂▂▂▂▁▂▁▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁ ▃
  71.6 μs          Histogram: frequency by time         99.5 μs <

 Memory estimate: 176.89 KiB, allocs estimate: 1044.

Elrod · September 18, 2021, 8:52pm

And with a little more work…

using RecursiveFactorization
function f5()
  n = 1000
  m = 10
  sms = Vector{Matrix{Float64}}(undef, Polyester.num_cores())
  sdets = Vector{Float64}(undef, Polyester.num_cores())
  
  d, r = divrem(n, Polyester.num_cores())
  @batch for i = 1:min(Polyester.num_cores(),n)
    offset = min(i-1, r) + (i-1)*d
    lrng = local_rng()
    sm = rand(lrng, m, m)
    mvⱼ = copy(sm)
    ipiv = Vector{LinearAlgebra.BlasInt}(undef, m)
    sdet = det(RecursiveFactorization.lu!(mvⱼ, ipiv, Val(true)))
    for j ∈ 2:d+(i ≤ r)
      rand!(lrng, mvⱼ)
      sm .+= mvⱼ
      sdet += det(RecursiveFactorization.lu!(mvⱼ, ipiv, Val(true)))
    end
    sms[i] = sm
    sdets[i] = sdet
  end
  sum(sdets), sum(sms)
end

Results:

julia> @benchmark f5()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  31.434 μs … 62.148 ms  ┊ GC (min … max):  0.00% … 43.27%
 Time  (median):     36.989 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   56.603 μs ±  1.071 ms  ┊ GC (mean ± σ):  14.10% ±  0.74%

   ▆▆▂▇   ▄▁█▆▃▁     ▁▂▁▂
  ▆█████▄▃██████▆▇▅▆██████▆▆▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  31.4 μs         Histogram: frequency by time        58.6 μs <

 Memory estimate: 38.23 KiB, allocs estimate: 58.

PharmCat · September 21, 2021, 9:32pm

Thank you very much! Using separate accumulators per thread instead of using a lock is a great idea. My current function is a little more complicated, and I save all matrices too in vector (there is no rand() it just for example), so I need linear indexes too (no problem to calculate them).

One more observation - when I use the example above inside module - I have error , if in REPL - no error

PharmCat · September 21, 2021, 11:56pm

I implemented this approach and get 9ms from 14 (1.5 times faster). Thanks a lot!

Topic		Replies	Views
Another slowdown when using `Threads.@threads`? General Usage performance	11	825	March 24, 2022
Using Polyester.jl with ChunkSplitters.jl Performance multithreading	4	476	January 11, 2024
Why with @threads, the execution time is worse? Performance question , multithreading	19	2797	September 16, 2021
Type-instability because of @threads boxing variables Performance parallel	23	1519	March 25, 2022
Nesting `Threads.@thread` and `Polyester.@batch` (or context manager to limit Polyester threads) Specific Domains multithreading , polyester	7	472	April 5, 2023

Questions about Polyester.jl

Related topics