Problem
I’m trying to parallelize my time propagator and not seeing very encouraging results.
A simple part is where I exponentiate a lot of independent operators using Crank–Nicolson:
where a_i and b_i are columns of matrices A and B, respectively, and h_i are complex-symmetric tridiagonal matrices. Since the h_i are time-independent, I can prefactorize the solver step (LDLt
factorization).
Code
At the moment, I’m trying to parallelize this by Threads.@threads
over the individual systems:
struct CrankNicolson{FT,BT}
F::FT
B::BT
end
"""
CrankNicolson(A, δt)
Create a Crank–Nicolson exponentiator `[I - δt/2*A]⁻¹[I + δt/2*A]`.
"""
function CrankNicolson(A::AbstractMatrix, δt::Number)
μ = δt/2
F = I + μ * A
B = factorize(I - μ * A)
CrankNicolson(F, B)
end
function expv!(w, cn::CrankNicolson, v)
# Propagation is "inplace", i.e. w is only used as a temporary.
mul!(w, cn.F, v)
ldiv!(v, cn.B, w)
end
struct ACollectionOfCranks{C<:CrankNicolson}
c::Vector{C}
end
ACollectionOfCranks(hs::AbstractVector{<:AbstractMatrix}, μ) =
ACollectionOfCranks([CrankNicolson(h, μ)
for h in hs])
function expv!(B, cs::ACollectionOfCranks, A)
Threads.@threads for j = 1:size(A,2)
expv!(view(B, :, j), cs.c[j], view(A, :, j))
end
end
nt = 36258
δt = 5.52e-02
nr = 3200
nl = 41
A = rand(ComplexF64, nr, nl)
B = similar(A)
hs = map(1:nl) do i
d = rand(ComplexF64, nr)
e = rand(ComplexF64, nr-1)
SymTridiagonal(d, e)
end
cs = ACollectionOfCranks(hs, -im*δt)
I’ve tested this on two different machines:
julia> versioninfo()
Julia Version 1.7.0-DEV.563
Commit 0858864409* (2021-02-17 20:33 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 9 3950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
julia> versioninfo()
Julia Version 1.7.0-DEV.499
Commit 21557721f2* (2021-02-09 08:54 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD EPYC 7402 24-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
where the second machine has two sockets, so in total 96 threads.
Results
Here are the results; the top panel shows median (solid lines) and mean (dashed lines) times, and the coloured fields show min to max times, from BenchmarkTools.jl. The bottom panel shows the speedups, the dotted line is linear scaling.
Parallelizing over the individual systems will of course not scale beyond 41 threads, since if we only have 41 systems, we will not saturate the threads. However, the speedup dies off much quicker than I would expect.
Ideas
The forward step, i.e.
is basically a contraction, so this I could probably rewrite using Tullio.jl, but backward step is slightly more complicated to wrap my head around, since the solution of tridiagonal step is essentially sequential, unless you use something like cyclic reduction.