Performance of Julia binary wrapper behind Matlab mex-file interface performance

andreasvarga · March 16, 2025, 4:02pm

I can now with certainty confirm that Matlab’s lyap is based on LAPACK’s dtrsyl3. Looking to the figures here, a speed gain up to 30 times can be expected with OpenBLAS for n = 1000 by replacing the call to dtrsyl by the call to dtrsyl3. BTW, is the *trsyl3 family already included in the libblastrampoline?

giordano · March 16, 2025, 4:10pm

I believe they were introduced in Upgrade gensymbol by amontoison · Pull Request #125 · JuliaLinearAlgebra/libblastrampoline · GitHub (CC @amontoison)

amontoison · March 16, 2025, 4:25pm

I confirm that it should be in LBT >= 5.9.0.

andreasvarga · March 19, 2025, 11:12am

I implemented an experimental wrapper trsyl3! to the LAPACK’s *trsyl3 family to be included in the next release of MatrixEquations. Just an update on the timings with Julia’s trsyl! (based on LAPACK’s dtrsyl) and the new trsyl3!:

julia> using BenchmarkTools

julia> n = 1000; a, = schur(rand(n,n)); c = rand(n,n); c = c'*c;

julia> @btime Y, scale = LinearAlgebra.LAPACK.trsyl!('N', 'T', $a, $a, copy($c));
  2.488 s (3 allocations: 7.63 MiB)

julia> @btime Y, scale = MatrixEquations.trsyl3!('N', 'T', $a, $a, copy($c));
  159.684 ms (6 allocations: 7.64 MiB)

trsyl3! is almost 15 times faster than trsyl!, using OpenBLAS and 1 thread.

Somewhat better timings with MKL and 8 threads:

julia> using MKL

julia> using LinearAlgebra

julia> LinearAlgebra.BLAS.set_num_threads(8)

julia> using MatrixEquations

julia> using BenchmarkTools

julia> n = 1000; a, = schur(rand(n,n)); c = rand(n,n); c = c'*c;

julia> @btime Y, scale = LinearAlgebra.LAPACK.trsyl!('N', 'T', $a, $a, copy($c));
  2.520 s (3 allocations: 7.63 MiB)

julia> @btime Y, scale = MatrixEquations.trsyl3!('N', 'T', $a, $a, copy($c));
  136.203 ms (6 allocations: 7.64 MiB)

julia> Threads.nthreads()
8

Fazit: trsyl3! is almost 20 times faster than trsyl!, which is as expected. Still the timing for the new wrapper trsyl3! is almost 3 times larger than for the MATLAB implementation in the internal function matlab.internal.math.sylvester_tri

K>> tic; X = matlab.internal.math.sylvester_tri(TA, TA, CC, 'I', 'I', 'transp'); toc
Elapsed time is 0.047330 seconds.

This discrepancy is about of the same order as that one for the formulated problem in the title of this post.

The question remains: Why?

giordano · March 19, 2025, 11:41am

andreasvarga · March 19, 2025, 11:51am

Unfortunately I have no any experience with VTune.

giordano · March 19, 2025, 12:48pm

Benny · March 20, 2025, 2:25am

The copy($c) is something you’re not doing in the MATLAB tic-toc run and is responsible for those 3 allocations and 7.63 MiB, according to @btime copy($c), and it accounts for a large part of the trysyl3! allocations too. However, I don’t think copy-ing would account for any of these timing discrepancies. You could try the setup argument to isolate it from the benchmark, but the copying benchmark only took a few milliseconds on my laptop at low battery, so I really doubt it’ll make a dent even in the relevant context.

andreasvarga · March 20, 2025, 3:06pm

Would it be possible to check that strsyl3_ is in LBT (both OpenBLAS and MKL)?

I made wrappers for :Float64, :Float32, :ComplexF64, :ComplexF32 data. The only which doesn’t work is for :Float32, receiving the following message:

julia> n = 5; a, = schur(rand(Float32,n,n)); c = rand(Float32,n,n); c = c'*c;
julia> @time Y, scale = MatrixEquations.trsyl3!('N', 'T', a, a, copy(c));
Error: no BLAS/LAPACK library loaded for strsyl3_64_()

No problem for single-precision complex data:

julia> n = 5; a, = schur(rand(Complex{Float32},n,n)); c = rand(Complex{Float32},n,n); c = c'*c;

julia> @time Y, scale = MatrixEquations.trsyl3!('N', 'C', a, a, copy(c));
  0.000036 seconds (13 allocations: 640 bytes)

If I load after starting Julia MKL, then it works correctly.

I checked the library libopenblas64_.dll indicated by the command:

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.dll

for the symbol strsyl3_64_ and this symbol is missing (the others dtrsyl_64_, ctrsyl_64_ and ztrsyl_64_ exist!). Could this be the explanation for the above error?

andreasvarga · March 21, 2025, 10:20am

I opened an issue #150 on the above problem.

andreasvarga · March 21, 2025, 10:23am

I agree with your points, but I used this form to prevent changing the argument c, which would happen at each call.

Topic		Replies	Views
Question on Calling Julia functions from C++ with CxxWrap (and Eventually Interfacing with Julia from MATLAB) General Usage question , embedding , matlab , c	2	714	September 25, 2021
Struggling to get Julia as fast as Matlab mex. (Filling large matrix, doing pairwise calculations over particles) Performance question	7	403	May 5, 2023
How to mitigate bad performance of eigvals from LinearAlgebra New to Julia question , mkl , linearalgebra	3	376	January 1, 2024
Speeding up some non-optimized Julia functions Performance	47	1163	July 19, 2022
Matlab versus Julia General Usage	33	4935	July 15, 2021

Performance of Julia binary wrapper behind Matlab mex-file interface performance

Related topics