I can now with certainty confirm that Matlab’s lyap is based on LAPACK’s dtrsyl3. Looking to the figures here, a speed gain up to 30 times can be expected with OpenBLAS for n = 1000 by replacing the call to dtrsyl by the call to dtrsyl3. BTW, is the *trsyl3 family already included in the libblastrampoline?
I believe they were introduced in Upgrade gensymbol by amontoison · Pull Request #125 · JuliaLinearAlgebra/libblastrampoline · GitHub (CC @amontoison)
I confirm that it should be in LBT >= 5.9.0.
I implemented an experimental wrapper trsyl3! to the LAPACK’s *trsyl3 family to be included in the next release of MatrixEquations. Just an update on the timings with Julia’s trsyl! (based on LAPACK’s dtrsyl) and the new trsyl3!:
julia> using BenchmarkTools
julia> n = 1000; a, = schur(rand(n,n)); c = rand(n,n); c = c'*c;
julia> @btime Y, scale = LinearAlgebra.LAPACK.trsyl!('N', 'T', $a, $a, copy($c));
2.488 s (3 allocations: 7.63 MiB)
julia> @btime Y, scale = MatrixEquations.trsyl3!('N', 'T', $a, $a, copy($c));
159.684 ms (6 allocations: 7.64 MiB)
trsyl3! is almost 15 times faster than trsyl!, using OpenBLAS and 1 thread.
Somewhat better timings with MKL and 8 threads:
julia> using MKL
julia> using LinearAlgebra
julia> LinearAlgebra.BLAS.set_num_threads(8)
julia> using MatrixEquations
julia> using BenchmarkTools
julia> n = 1000; a, = schur(rand(n,n)); c = rand(n,n); c = c'*c;
julia> @btime Y, scale = LinearAlgebra.LAPACK.trsyl!('N', 'T', $a, $a, copy($c));
2.520 s (3 allocations: 7.63 MiB)
julia> @btime Y, scale = MatrixEquations.trsyl3!('N', 'T', $a, $a, copy($c));
136.203 ms (6 allocations: 7.64 MiB)
julia> Threads.nthreads()
8
Fazit: trsyl3! is almost 20 times faster than trsyl!, which is as expected. Still the timing for the new wrapper trsyl3! is almost 3 times larger than for the MATLAB implementation in the internal function matlab.internal.math.sylvester_tri
K>> tic; X = matlab.internal.math.sylvester_tri(TA, TA, CC, 'I', 'I', 'transp'); toc
Elapsed time is 0.047330 seconds.
This discrepancy is about of the same order as that one for the formulated problem in the title of this post.
The question remains: Why?
![]()
Unfortunately I have no any experience with VTune.
The copy($c) is something you’re not doing in the MATLAB tic-toc run and is responsible for those 3 allocations and 7.63 MiB, according to @btime copy($c), and it accounts for a large part of the trysyl3! allocations too. However, I don’t think copy-ing would account for any of these timing discrepancies. You could try the setup argument to isolate it from the benchmark, but the copying benchmark only took a few milliseconds on my laptop at low battery, so I really doubt it’ll make a dent even in the relevant context.
Would it be possible to check that strsyl3_ is in LBT (both OpenBLAS and MKL)?
I made wrappers for :Float64, :Float32, :ComplexF64, :ComplexF32 data. The only which doesn’t work is for :Float32, receiving the following message:
julia> n = 5; a, = schur(rand(Float32,n,n)); c = rand(Float32,n,n); c = c'*c;
julia> @time Y, scale = MatrixEquations.trsyl3!('N', 'T', a, a, copy(c));
Error: no BLAS/LAPACK library loaded for strsyl3_64_()
No problem for single-precision complex data:
julia> n = 5; a, = schur(rand(Complex{Float32},n,n)); c = rand(Complex{Float32},n,n); c = c'*c;
julia> @time Y, scale = MatrixEquations.trsyl3!('N', 'C', a, a, copy(c));
0.000036 seconds (13 allocations: 640 bytes)
If I load after starting Julia MKL, then it works correctly.
I checked the library libopenblas64_.dll indicated by the command:
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.dll
for the symbol strsyl3_64_ and this symbol is missing (the others dtrsyl_64_, ctrsyl_64_ and ztrsyl_64_ exist!). Could this be the explanation for the above error?
I opened an issue #150 on the above problem.
I agree with your points, but I used this form to prevent changing the argument c, which would happen at each call.
I would like to close this discussion with positive news. I implemented a new version of the Lyapunov equation solver based on recursive blocking techniques. With this new technique I got the following results for solving a Lyapunov equation of order 1000 with the new lyapc:
using MKL
using LinearAlgebra
using MatrixEquations
using BenchmarkTools
n = 1000; a = rand(n,n); c = collect(hermitianpart!(rand(n,n)));
@btime lyapc($a,$c,blocksize=32);
390.184 ms (2575 allocations: 49.91 MiB)
This is a single threaded execution. For the Julia’s lyap function the timing is
@btime x = lyap($a,$c);
2.891 s (33 allocations: 46.07 MiB)
which is a huge difference.
For MATLAB 2024b running with 1 thread I got with the lyap function:
>> n = 1000; a = rand(n); c = rand(n); c = c+c';
>> maxNumCompThreads(1)
ans =
1
>> tic; lyap(a,c); toc
Elapsed time is 0.739932 seconds.
>> maxNumCompThreads(8)
ans =
8
>> tic; lyap(a,c); toc
Elapsed time is 0.449892 seconds.
So, the single threaded new Julia solver lyapc is faster than the MATLAB’s lyap running with 8 threads!
Can we switch julia over to using trsyl3! by default?
This is certainly possible and even desirable. It would be also nice to check if the input matrix C is symmetric/Hermitian and provide accordingly the symmetric/Hermitian solution. It is also desirable to switch to using trsyl3! in the sylvester function.
The wrappers for trsyl3! are available in MatrixEquations.jl in the directory src/lapackutils.jl.
FYI, these are the results on my MacBook Air (M2):
julia> using LinearAlgebra
julia> using BenchmarkTools
julia> using MatrixEquations
julia> n = 1000; a = rand(n,n); c = collect(hermitianpart!(rand(n,n)));
julia> @btime lyapc($a,$c,blocksize=32);
418.152 ms (2599 allocations: 50.06 MiB)
julia> @btime x = lyap($a,$c);
1.465 s (33 allocations: 46.14 MiB)
The gap is a bit narrower.
Furthermore, with AppleAccelerate.jl:
julia> using AppleAccelerate
julia> @btime lyapc($a,$c,blocksize=32);
362.640 ms (2599 allocations: 50.06 MiB)
julia> @btime x = lyap($a,$c);
983.153 ms (33 allocations: 46.14 MiB)
In Matlab:
>> n = 1000; a = rand(n); c = rand(n); c = c+c';
>> maxNumCompThreads(1)
ans =
8
>> tic; lyap(a,c); toc
Elapsed time is 0.577691 seconds.
>> maxNumCompThreads(8)
ans =
1
>> tic; lyap(a,c); toc
Elapsed time is 0.402229 seconds.
For N=1000, Julia’s multi-threading did not increase efficiency. So I tried N = 4000. Here are the results (for Windows 11 with Julia 1.12.5):
julia> n = 4000; a = rand(n,n); c = collect(hermitianpart!(rand(n,n)));
julia> LinearAlgebra.BLAS.set_num_threads(1)
julia> @time lyapc(a,c; blocksize=64);
31.234233 seconds (9.12 k allocations: 794.227 MiB, 0.34% gc time)
julia> LinearAlgebra.BLAS.set_num_threads(8)
julia> @time lyapc(a,c; blocksize=64);
18.761670 seconds (9.12 k allocations: 794.226 MiB, 0.94% gc time)
The same for MATLAB:
>> n = 4000; a = rand(n); c = rand(n); c = c+c';
>> maxNumCompThreads(1)
ans =
8
>> tic; lyap(a,c); toc
Elapsed time is 32.669608 seconds.
>> maxNumCompThreads(8)
ans =
1
>> tic; lyap(a,c); toc
Elapsed time is 19.302684 seconds.
Thus:
- Julia (1 thread \rightarrow 8 threads): 31.23\text{s} \rightarrow 18.76\text{s} (A 1.66x speedup).
- MATLAB (1 thread \rightarrow 8 threads): 32.66\text{s} \rightarrow 19.30\text{s} (A 1.69x speedup).
There is no 8x speedup, but this is because there is no intrinsic parallelism in the basic algorithm. The 1.68x gain arises (probably) from calls to sysr2k! .
The next challenge is the solution of discrete Lyapunov equations at MATLAB’s speed. The situation now (for N = 4000) is:
with lyapd from MatrixEquations.jl:
julia> @time lyapd(a,c);
118.197316 seconds (49 allocations: 733.675 MiB, 0.13% gc time)
with MATLAB’s dlyap:
>> tic; dlyap(a,c); toc
Elapsed time is 21.115511 seconds.
These are the results for the new discrete Lyapunov solver based on recursive blocking algorithm:
MATLAB:
>> n = 4000; a = rand(n); c = rand(n); c = c+c';
>> maxNumCompThreads(1)
ans =
8
>> tic; dlyap(a,c); toc
Elapsed time is 35.799508 seconds.
>> maxNumCompThreads(8)
ans =
1
>> tic; dlyap(a,c); toc
Elapsed time is 21.307347 seconds.
Julia:
using MKL
using LinearAlgebra
using MatrixEquations
julia> n = 4000; a = rand(n,n); c = collect(hermitianpart!(rand(n,n)));
julia> LinearAlgebra.BLAS.set_num_threads(1)
julia> @time lyapd(a,c;blocksize=32);
33.439628 seconds (162.77 k allocations: 902.105 MiB, 0.37% gc time)
julia> @time lyapd(a,c;blocksize=64);
34.137566 seconds (38.86 k allocations: 892.192 MiB, 0.54% gc time)
julia> @time lyapd(a,c;blocksize=128);
34.852278 seconds (9.85 k allocations: 888.748 MiB, 0.51% gc time)
julia> LinearAlgebra.BLAS.set_num_threads(8)
julia> @time lyapd(a,c;blocksize=32);
20.961486 seconds (155.50 k allocations: 901.627 MiB, 0.88% gc time)
julia> @time lyapd(a,c;blocksize=64);
21.096944 seconds (38.86 k allocations: 892.190 MiB, 0.51% gc time)
julia> @time lyapd(a,c;blocksize=128);
21.312522 seconds (9.83 k allocations: 888.747 MiB, 0.85% gc time)
I would say we are on par with MATLAB!
Excellent! FYI, this is what I get on my M2 MacBook Air:
Matlab:
>> n = 4000; a = rand(n); c = rand(n); c = c+c';
>> maxNumCompThreads(1)
ans =
8
>> tic; dlyap(a,c); toc
Elapsed time is 18.943170 seconds.
>> maxNumCompThreads(8)
ans =
1
>> tic; dlyap(a,c); toc
Elapsed time is 19.007589 seconds.
Julia:
julia> using AppleAccelerate
julia> using LinearAlgebra
julia> using BenchmarkTools
julia> using MatrixEquations
julia> n = 4000; a = rand(n,n); c = collect(hermitianpart!(rand(n,n)));
julia> LinearAlgebra.BLAS.set_num_threads(1)
julia> @btime lyapd($a,$c;blocksize=32);
18.858 s (169977 allocations: 902.66 MiB)
julia> @btime lyapd($a,$c;blocksize=64);
19.534 s (38862 allocations: 892.28 MiB)
julia> LinearAlgebra.BLAS.set_num_threads(8)
julia> @btime lyapd($a,$c;blocksize=32);
18.869 s (169977 allocations: 902.66 MiB)
julia> @btime lyapd($a,$c;blocksize=64);
19.456 s (38862 allocations: 892.28 MiB)
By the way, upon noticing that apparently on my M2 MacBook Air there was no difference if I set set_num_threads(1) or set_num_threads(8), I read somewhere that AppleAccelerate ignores these. I then re-ran the above code without using AppleAccelerate. There is now some response to set_num_threads(8), but the results are inferior to using AppleAccelerate:
julia> using LinearAlgebra
julia> using BenchmarkTools
julia> using MatrixEquations
julia> n = 4000; a = rand(n,n); c = collect(hermitianpart!(rand(n,n)));
julia> LinearAlgebra.BLAS.set_num_threads(1)
julia> @btime lyapd($a,$c;blocksize=32);
32.491 s (169977 allocations: 902.66 MiB)
julia> @btime lyapd($a,$c;blocksize=64);
32.964 s (38862 allocations: 892.28 MiB)
julia> LinearAlgebra.BLAS.set_num_threads(8)
julia> @btime lyapd($a,$c;blocksize=32);
25.328 s (169977 allocations: 902.65 MiB)
julia> @btime lyapd($a,$c;blocksize=64);
27.323 s (38862 allocations: 892.28 MiB)