Performance of Julia binary wrapper behind Matlab mex-file interface performance

andreasvarga · March 14, 2025, 10:51am

Recently I implemented a julia wrapper for a subroutine of the SLICOT library (specifically SB03MD to solve Lyapunov matrix equations). I used the generated binary wrappers in conjunction with Intel’s MKL, which is used also in MATLAB. I tested just the time required by the ccall:


    # SUBROUTINE SB03MD( DICO, JOB, FACT, TRANA, 
    #                    N, A, LDA, U, LDU, C, LDC, 
    #                    SCALE, SEP, FERR, WR, WI, IWORK, DWORK,
    #                    LDWORK, INFO )
    @time ccall((:sb03md_, libslicot), Cvoid, (Ref{UInt8}, Ref{UInt8}, Ref{UInt8}, Ref{UInt8},  
            Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt}, 
            Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, 
            Ptr{BlasInt}, Ptr{Float64}, Ref{BlasInt}, Ref{BlasInt}, Clong, Clong, Clong, Clong), 
            dico, job, fact, trana, n, A, lda, U, ldu, C, ldc,  
            scale, sep, ferr, wr, wi, 
            iwork, dwork, ldwork, info, 1, 1, 1, 1)

to eliminate the times required by all previous allocations.

For comparison, I used the function sllyap.m, a mex-file based implementation of the same function available in the SLICOT-BasicControl toolbox.

For a Lyapunov equation with randomly generated matrices, in MATLAB, I obtained:

>> n = 1000; a = rand(n,n); u = rand(n,n); c = rand(n,n); c = c'*c;
>> tic; x1 = sllyap(a',-c); toc
Elapsed time is 0.548520 seconds.

while for the Julia wrapper I got about 1.485512 seconds (several times, trust me!). This is almost 3 times slower than the MATLAB interface!

I wonder, what is the cause of this difference. Any opinion is most welcome.

The MATLAB tests have been performed with MATLAB R2024b and the Julia tests with Julia Version 1.11.2, both on Windows.

Note: The MATLAB official function lyap is even faster, requiring 0.439914 seconds, but apparently another LAPACK-based Sylvester equation solver is employed (although it is claimed that SB03MD is used!)

baggepinnen · March 14, 2025, 11:32am

Is matlab using threading while your call from julia does not?

andreasvarga · March 14, 2025, 11:43am

Frankly, I don’t know. The used mex-file can be inspected at https://github.com/SLICOT/SLICOT-BasicControl/blob/master/mex_src/linmeq.f.

However, with 4 theads I got the same timing result.

baggepinnen · March 14, 2025, 11:52am

You can inspect the CPU usage on your machine while you are running a benchmark, e.g. using “system monitor” on linux or “Task manager” on windows.

My CPU is currently mostly idle

but if I run LinearAlgebra.peakflops(2^14)

you see it starts using several CPU cores

andreasvarga · March 14, 2025, 12:05pm

The CPU peaks are 1.5 times higher when executing lyap of MATLAB compared to Julia. I have 8 cores on my PC.

stevengj · March 14, 2025, 1:40pm

How did you set the threads?

In Matlab, by default it uses all the cores on your machine for its BLAS. You can start it with -singleCompThread, or use maxNumCompThreads(1), to force it to run single-threaded.

In Julia, do LinearAlgebra.BLAS.set_num_threads(1) (this is separate from the number of native Julia threads).

Usually, you want to benchmark everything single-threaded first, before you try to understand the multi-core performance.

andreasvarga · March 14, 2025, 1:58pm

With
maxNumCompThreads(1)
I got a certain slowdown to 0.795377 seconds.


julia> Threads.nthreads()
8
julia> LinearAlgebra.BLAS.set_num_threads(8)
julia> n = 1000; a = rand(n,n); c = rand(n,n); c = c'*c;

julia> x = sllyapc(a,c);  # only the ccall time is displayed by the equivalent function in Julia
  1.515265 seconds
julia> LinearAlgebra.BLAS.set_num_threads(1)

julia> x = sllyapc(a,c);
  1.652563 seconds

A small slowdown for single-threaded run.

andreasvarga · March 14, 2025, 2:13pm

I wonder if the compilation options used when generating the binary wrappers play a role here.

stevengj · March 14, 2025, 3:24pm

What about Julia’s lyap function?

You might also just benchmark something very simple like a 2000x2000 matrix multiplication. If that is not nearly identical between Matlab and Julia, then you are doing something different in terms of the BLAS library.

andreasvarga · March 14, 2025, 6:43pm

This is the result with Julia’s lyap running with 8 threads

julia> @btime x = lyap($a,$c);
  2.908 s (34 allocations: 46.07 MiB)

which is thus 5.3 times slower than Matlab’s lyap (provided in the Control Systems Toolbox).

The equivalent Julia function lyapc available in the MatrixEquations performs quite well

julia> @btime x = lyapc($a,$c);
  1.170 s (48 allocations: 46.09 MiB)

and the SLICOT-wrapper based implementation produces

julia> @btime x = sllyapc($a,$c);
  1.425 s (31 allocations: 45.91 MiB)

Recall the MATLAB results:

>> n = 1000; a = rand(n,n); u = rand(n,n); c = rand(n,n); c = c'*c;
>> tic; x = lyap(a,c); toc
Elapsed time is 0.548693 seconds.

and the mex-file based solver based on SLICOT produces

>> tic; x = sllyap(a',-c); toc
Elapsed time is 0.613302 seconds.

Regarding matrix multiplications, we have in Julia

julia> n = 2000; a = rand(n,n); b = rand(n,n);
julia> @btime c = a*b;
  73.168 ms (3 allocations: 30.52 MiB)

and in MATLAB

>> n = 2000; a = rand(n); b = rand(n);
>> tic; c = a*b; toc
Elapsed time is 0.095879 seconds.

so Julia is somewhat faster.

Just a comment on Julia’s and Matlab’s lyap functions: Both solvers are based on the LAPACK’s Sylvester equation solver, but the Matlab solver performs the final symmetrization of the solution if symmetric solution is expected, while Julia’s lyap omits this. The lyapc solver in MatrixEquations exploits all structural features of the problem and is fully implemented in Julia. This partly explains the better performance than the standard lyap function.

stevengj · March 14, 2025, 9:12pm

This is what makes this mysterious.

Or maybe Matlab’s lyap function is taking advantage of your c matrix being Hermitian (up to roundoff errors) and dispatching to a different algorithm?

stevengj · March 14, 2025, 9:36pm

Actually, it looks they are not both calling the same LAPACK routine here. Julia is using dtrsyl from LAPACK, but according to the Matlab lyap docs it “uses SLICOT routines SB03MD and SG03AD for Lyapunov equations”.

andreasvarga · March 15, 2025, 10:10am

I think this information is false. The call after the reduction to Schur form is

% Solve Lyapunov Equation TA*X + X*TA' = QA'*C*QA.
   X = matlab.internal.math.sylvester_tri(TA, TA, CC, 'I', 'I', 'transp');

which calls an internal function. Behind this I suppose is dtrsyl from LAPACK. I checked with the MATLAB debugger that the resulting X is not exactly symmetric. SB03MD provides always a symmetric solution and, actually, exploits this property by determining only the elements in the upper triangle.

I measured the time spent only in the calls to LAPACK dtrsyl. For MATLAB I got

K>> tic; X = matlab.internal.math.sylvester_tri(TA, TA, CC, 'I', 'I', 'transp'); toc
Elapsed time is 0.047330 seconds.

For Julia

n = 1000; a, = schur(rand(n,n)); c = rand(n,n); c = c'*c;
julia> @time Y, scale = LAPACK.trsyl!('N', 'T', a, a, c);
  2.556617 seconds (7 allocations: 192 bytes)

which is about 50 times longer! Using more cores in Julia is not a big help

julia> LinearAlgebra.BLAS.set_num_threads(8)

julia> @time Y, scale = LAPACK.trsyl!('N', 'T', a, a, c);
  2.521782 seconds (7 allocations: 192 bytes)

Benny · March 15, 2025, 1:01pm

Just checking, how exactly are you doing this?

That’s unfortunate. Downside of MATLAB being closed-source is relying on accurate documentation of builtins and how much more difficult it is to spot errors to report. Reproducing MATLAB’s good things is often unfeasible, which is part of the point I suppose. You know more about SLICOT-BasicControl’s inner workings, at least.

andreasvarga · March 15, 2025, 1:18pm

After starting Julia, the first command is
using MKL
and then loading other stuff.

andreasvarga · March 15, 2025, 1:46pm

The time to execute the ccall in the Julia wrapper trsyl! (see below)

            @time ccall((:dtrsyl_, libblastrampoline), Cvoid,
                (Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt}, Ref{BlasInt},
                 Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt},
                 Ptr{Float64}, Ref{BlasInt}, Clong, Clong),
                transa, transb, isgn, m, n,
                A, lda, B, ldb, C, ldc,
                scale, info, 1, 1)

is 2.534428 seconds (to be compared with the equivalent MATLAB call with
0.047330 seconds). This large gap is not anymore related to SLICOT, but is an interfacing problem of Julia with LAPACK. I wonder if I should open an issue for this or wait for opinions of somebody knowledgeable with BLAS/LAPACK interfacing aspects.

Benny · March 15, 2025, 3:35pm

I would expect you to be using the BLAS/LAPACK from the updated MKL then. MATLAB may be doing something else as far as we know, but I doubt they’re making some proprietary software that surpasses MKL on an Intel CPU (which I assume you’re implying).

Something else I’m wondering, what system are you on (versioninfo)? Not sure if I’m misremembering something, but shouldn’t there be 64 in the BLAS/LAPACK names for 64-bit systems? That’s what the @blasfunc macro call is adding to the dtrsyl_ name in lapack.jl for me, anyway.

Doesn’t hurt to open an issue, at least another place to wait for opinions.

andreasvarga · March 15, 2025, 4:46pm

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

Also

julia> using MKL

julia> using LinearAlgebra

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
├ [ILP64] mkl_rt.2.dll
└ [ LP64] mkl_rt.2.dll

andreasvarga · March 16, 2025, 11:32am

One possibility for higher speed in MATLAB would be, if the internal function used for test

K>> tic; X = matlab.internal.math.sylvester_tri(TA, TA, CC, 'I', 'I', 'transp'); toc
Elapsed time is 0.047330 seconds.

calls the new Level 3 LAPACK solver dtrsyl3 instead of dtrsyl, which is used in Julia’s lyap. Still, the huge gap (50 times speed difference) must have another explanation.

Topic		Replies	Views
Question on Calling Julia functions from C++ with CxxWrap (and Eventually Interfacing with Julia from MATLAB) General Usage question , embedding , matlab , c	2	714	September 25, 2021
Struggling to get Julia as fast as Matlab mex. (Filling large matrix, doing pairwise calculations over particles) Performance question	7	406	May 5, 2023
How to mitigate bad performance of eigvals from LinearAlgebra New to Julia question , mkl , linearalgebra	3	376	January 1, 2024
Speeding up some non-optimized Julia functions Performance	47	1170	July 19, 2022
Matlab versus Julia General Usage	33	4938	July 15, 2021

Performance of Julia binary wrapper behind Matlab mex-file interface performance

Related topics