I can now with certainty confirm that Matlab’s lyap
is based on LAPACK’s dtrsyl3
. Looking to the figures here, a speed gain up to 30 times can be expected with OpenBLAS for n = 1000 by replacing the call to dtrsyl
by the call to dtrsyl3
. BTW, is the *trsyl3
family already included in the libblastrampoline
?
I believe they were introduced in Upgrade gensymbol by amontoison · Pull Request #125 · JuliaLinearAlgebra/libblastrampoline · GitHub (CC @amontoison)
I confirm that it should be in LBT >= 5.9.0.
I implemented an experimental wrapper trsyl3!
to the LAPACK’s *trsyl3
family to be included in the next release of MatrixEquations. Just an update on the timings with Julia’s trsyl!
(based on LAPACK’s dtrsyl
) and the new trsyl3!
:
julia> using BenchmarkTools
julia> n = 1000; a, = schur(rand(n,n)); c = rand(n,n); c = c'*c;
julia> @btime Y, scale = LinearAlgebra.LAPACK.trsyl!('N', 'T', $a, $a, copy($c));
2.488 s (3 allocations: 7.63 MiB)
julia> @btime Y, scale = MatrixEquations.trsyl3!('N', 'T', $a, $a, copy($c));
159.684 ms (6 allocations: 7.64 MiB)
trsyl3!
is almost 15 times faster than trsyl!
, using OpenBLAS and 1 thread.
Somewhat better timings with MKL
and 8 threads:
julia> using MKL
julia> using LinearAlgebra
julia> LinearAlgebra.BLAS.set_num_threads(8)
julia> using MatrixEquations
julia> using BenchmarkTools
julia> n = 1000; a, = schur(rand(n,n)); c = rand(n,n); c = c'*c;
julia> @btime Y, scale = LinearAlgebra.LAPACK.trsyl!('N', 'T', $a, $a, copy($c));
2.520 s (3 allocations: 7.63 MiB)
julia> @btime Y, scale = MatrixEquations.trsyl3!('N', 'T', $a, $a, copy($c));
136.203 ms (6 allocations: 7.64 MiB)
julia> Threads.nthreads()
8
Fazit: trsyl3!
is almost 20 times faster than trsyl!
, which is as expected. Still the timing for the new wrapper trsyl3!
is almost 3 times larger than for the MATLAB implementation in the internal function matlab.internal.math.sylvester_tri
K>> tic; X = matlab.internal.math.sylvester_tri(TA, TA, CC, 'I', 'I', 'transp'); toc
Elapsed time is 0.047330 seconds.
This discrepancy is about of the same order as that one for the formulated problem in the title of this post.
The question remains: Why?
Unfortunately I have no any experience with VTune.
The copy($c)
is something you’re not doing in the MATLAB tic
-toc
run and is responsible for those 3 allocations and 7.63 MiB, according to @btime copy($c)
, and it accounts for a large part of the trysyl3!
allocations too. However, I don’t think copy
-ing would account for any of these timing discrepancies. You could try the setup
argument to isolate it from the benchmark, but the copying benchmark only took a few milliseconds on my laptop at low battery, so I really doubt it’ll make a dent even in the relevant context.
Would it be possible to check that strsyl3_
is in LBT (both OpenBLAS
and MKL
)?
I made wrappers for :Float64, :Float32, :ComplexF64, :ComplexF32 data. The only which doesn’t work is for :Float32, receiving the following message:
julia> n = 5; a, = schur(rand(Float32,n,n)); c = rand(Float32,n,n); c = c'*c;
julia> @time Y, scale = MatrixEquations.trsyl3!('N', 'T', a, a, copy(c));
Error: no BLAS/LAPACK library loaded for strsyl3_64_()
No problem for single-precision complex data:
julia> n = 5; a, = schur(rand(Complex{Float32},n,n)); c = rand(Complex{Float32},n,n); c = c'*c;
julia> @time Y, scale = MatrixEquations.trsyl3!('N', 'C', a, a, copy(c));
0.000036 seconds (13 allocations: 640 bytes)
If I load after starting Julia MKL
, then it works correctly.
I checked the library libopenblas64_.dll
indicated by the command:
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.dll
for the symbol strsyl3_64_
and this symbol is missing (the others dtrsyl_64_
, ctrsyl_64_
and ztrsyl_64_
exist!). Could this be the explanation for the above error?
I opened an issue #150 on the above problem.
I agree with your points, but I used this form to prevent changing the argument c
, which would happen at each call.