Performance of Julia binary wrapper behind Matlab mex-file interface performance

I can now with certainty confirm that Matlab’s lyap is based on LAPACK’s dtrsyl3. Looking to the figures here, a speed gain up to 30 times can be expected with OpenBLAS for n = 1000 by replacing the call to dtrsyl by the call to dtrsyl3. BTW, is the *trsyl3 family already included in the libblastrampoline?

1 Like

I believe they were introduced in Upgrade gensymbol by amontoison · Pull Request #125 · JuliaLinearAlgebra/libblastrampoline · GitHub (CC @amontoison)

1 Like

I confirm that it should be in LBT >= 5.9.0.

1 Like

I implemented an experimental wrapper trsyl3! to the LAPACK’s *trsyl3 family to be included in the next release of MatrixEquations. Just an update on the timings with Julia’s trsyl! (based on LAPACK’s dtrsyl) and the new trsyl3!:

julia> using BenchmarkTools

julia> n = 1000; a, = schur(rand(n,n)); c = rand(n,n); c = c'*c;

julia> @btime Y, scale = LinearAlgebra.LAPACK.trsyl!('N', 'T', $a, $a, copy($c));
  2.488 s (3 allocations: 7.63 MiB)

julia> @btime Y, scale = MatrixEquations.trsyl3!('N', 'T', $a, $a, copy($c));
  159.684 ms (6 allocations: 7.64 MiB)

trsyl3! is almost 15 times faster than trsyl!, using OpenBLAS and 1 thread.

Somewhat better timings with MKL and 8 threads:

julia> using MKL

julia> using LinearAlgebra

julia> LinearAlgebra.BLAS.set_num_threads(8)

julia> using MatrixEquations

julia> using BenchmarkTools

julia> n = 1000; a, = schur(rand(n,n)); c = rand(n,n); c = c'*c;

julia> @btime Y, scale = LinearAlgebra.LAPACK.trsyl!('N', 'T', $a, $a, copy($c));
  2.520 s (3 allocations: 7.63 MiB)

julia> @btime Y, scale = MatrixEquations.trsyl3!('N', 'T', $a, $a, copy($c));
  136.203 ms (6 allocations: 7.64 MiB)

julia> Threads.nthreads()
8

Fazit: trsyl3! is almost 20 times faster than trsyl!, which is as expected. Still the timing for the new wrapper trsyl3! is almost 3 times larger than for the MATLAB implementation in the internal function matlab.internal.math.sylvester_tri

K>> tic; X = matlab.internal.math.sylvester_tri(TA, TA, CC, 'I', 'I', 'transp'); toc
Elapsed time is 0.047330 seconds.

This discrepancy is about of the same order as that one for the formulated problem in the title of this post.

The question remains: Why?

1 Like

:backhand_index_pointing_down:

Unfortunately I have no any experience with VTune.

The copy($c) is something you’re not doing in the MATLAB tic-toc run and is responsible for those 3 allocations and 7.63 MiB, according to @btime copy($c), and it accounts for a large part of the trysyl3! allocations too. However, I don’t think copy-ing would account for any of these timing discrepancies. You could try the setup argument to isolate it from the benchmark, but the copying benchmark only took a few milliseconds on my laptop at low battery, so I really doubt it’ll make a dent even in the relevant context.

Would it be possible to check that strsyl3_ is in LBT (both OpenBLAS and MKL)?

I made wrappers for :Float64, :Float32, :ComplexF64, :ComplexF32 data. The only which doesn’t work is for :Float32, receiving the following message:

julia> n = 5; a, = schur(rand(Float32,n,n)); c = rand(Float32,n,n); c = c'*c;
julia> @time Y, scale = MatrixEquations.trsyl3!('N', 'T', a, a, copy(c));
Error: no BLAS/LAPACK library loaded for strsyl3_64_()

No problem for single-precision complex data:

julia> n = 5; a, = schur(rand(Complex{Float32},n,n)); c = rand(Complex{Float32},n,n); c = c'*c;

julia> @time Y, scale = MatrixEquations.trsyl3!('N', 'C', a, a, copy(c));
  0.000036 seconds (13 allocations: 640 bytes)

If I load after starting Julia MKL, then it works correctly.

I checked the library libopenblas64_.dll indicated by the command:

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.dll

for the symbol strsyl3_64_ and this symbol is missing (the others dtrsyl_64_, ctrsyl_64_ and ztrsyl_64_ exist!). Could this be the explanation for the above error?

I opened an issue #150 on the above problem.

1 Like

I agree with your points, but I used this form to prevent changing the argument c, which would happen at each call.