Performance of Julia binary wrapper behind Matlab mex-file interface performance

Recently I implemented a julia wrapper for a subroutine of the SLICOT library (specifically SB03MD to solve Lyapunov matrix equations). I used the generated binary wrappers in conjunction with Intel’s MKL, which is used also in MATLAB. I tested just the time required by the ccall:


    # SUBROUTINE SB03MD( DICO, JOB, FACT, TRANA, 
    #                    N, A, LDA, U, LDU, C, LDC, 
    #                    SCALE, SEP, FERR, WR, WI, IWORK, DWORK,
    #                    LDWORK, INFO )
    @time ccall((:sb03md_, libslicot), Cvoid, (Ref{UInt8}, Ref{UInt8}, Ref{UInt8}, Ref{UInt8},  
            Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt}, 
            Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, 
            Ptr{BlasInt}, Ptr{Float64}, Ref{BlasInt}, Ref{BlasInt}, Clong, Clong, Clong, Clong), 
            dico, job, fact, trana, n, A, lda, U, ldu, C, ldc,  
            scale, sep, ferr, wr, wi, 
            iwork, dwork, ldwork, info, 1, 1, 1, 1)

to eliminate the times required by all previous allocations.

For comparison, I used the function sllyap.m, a mex-file based implementation of the same function available in the SLICOT-BasicControl toolbox.

For a Lyapunov equation with randomly generated matrices, in MATLAB, I obtained:

>> n = 1000; a = rand(n,n); u = rand(n,n); c = rand(n,n); c = c'*c;
>> tic; x1 = sllyap(a',-c); toc
Elapsed time is 0.548520 seconds.

while for the Julia wrapper I got about 1.485512 seconds (several times, trust me!). This is almost 3 times slower than the MATLAB interface!

I wonder, what is the cause of this difference. Any opinion is most welcome.

The MATLAB tests have been performed with MATLAB R2024b and the Julia tests with Julia Version 1.11.2, both on Windows.

Note: The MATLAB official function lyap is even faster, requiring 0.439914 seconds, but apparently another LAPACK-based Sylvester equation solver is employed (although it is claimed that SB03MD is used!)

Is matlab using threading while your call from julia does not?

Frankly, I don’t know. The used mex-file can be inspected at https://github.com/SLICOT/SLICOT-BasicControl/blob/master/mex_src/linmeq.f.

However, with 4 theads I got the same timing result.

You can inspect the CPU usage on your machine while you are running a benchmark, e.g. using “system monitor” on linux or “Task manager” on windows.

My CPU is currently mostly idle

but if I run LinearAlgebra.peakflops(2^14)

you see it starts using several CPU cores

The CPU peaks are 1.5 times higher when executing lyap of MATLAB compared to Julia. I have 8 cores on my PC.

How did you set the threads?

In Matlab, by default it uses all the cores on your machine for its BLAS. You can start it with -singleCompThread, or use maxNumCompThreads(1), to force it to run single-threaded.

In Julia, do LinearAlgebra.BLAS.set_num_threads(1) (this is separate from the number of native Julia threads).

Usually, you want to benchmark everything single-threaded first, before you try to understand the multi-core performance.

2 Likes

With
maxNumCompThreads(1)
I got a certain slowdown to 0.795377 seconds.


julia> Threads.nthreads()
8
julia> LinearAlgebra.BLAS.set_num_threads(8)
julia> n = 1000; a = rand(n,n); c = rand(n,n); c = c'*c;

julia> x = sllyapc(a,c);  # only the ccall time is displayed by the equivalent function in Julia
  1.515265 seconds
julia> LinearAlgebra.BLAS.set_num_threads(1)

julia> x = sllyapc(a,c);
  1.652563 seconds

A small slowdown for single-threaded run.

I wonder if the compilation options used when generating the binary wrappers play a role here.

What about Julia’s lyap function?

You might also just benchmark something very simple like a 2000x2000 matrix multiplication. If that is not nearly identical between Matlab and Julia, then you are doing something different in terms of the BLAS library.

This is the result with Julia’s lyap running with 8 threads

julia> @btime x = lyap($a,$c);
  2.908 s (34 allocations: 46.07 MiB)

which is thus 5.3 times slower than Matlab’s lyap (provided in the Control Systems Toolbox).

The equivalent Julia function lyapc available in the MatrixEquations performs quite well

julia> @btime x = lyapc($a,$c);
  1.170 s (48 allocations: 46.09 MiB)

and the SLICOT-wrapper based implementation produces

julia> @btime x = sllyapc($a,$c);
  1.425 s (31 allocations: 45.91 MiB)

Recall the MATLAB results:

>> n = 1000; a = rand(n,n); u = rand(n,n); c = rand(n,n); c = c'*c;
>> tic; x = lyap(a,c); toc
Elapsed time is 0.548693 seconds.

and the mex-file based solver based on SLICOT produces

>> tic; x = sllyap(a',-c); toc
Elapsed time is 0.613302 seconds.

Regarding matrix multiplications, we have in Julia

julia> n = 2000; a = rand(n,n); b = rand(n,n);
julia> @btime c = a*b;
  73.168 ms (3 allocations: 30.52 MiB)

and in MATLAB

>> n = 2000; a = rand(n); b = rand(n);
>> tic; c = a*b; toc
Elapsed time is 0.095879 seconds.

so Julia is somewhat faster.

Just a comment on Julia’s and Matlab’s lyap functions: Both solvers are based on the LAPACK’s Sylvester equation solver, but the Matlab solver performs the final symmetrization of the solution if symmetric solution is expected, while Julia’s lyap omits this. The lyapc solver in MatrixEquations exploits all structural features of the problem and is fully implemented in Julia. This partly explains the better performance than the standard lyap function.

1 Like

This is what makes this mysterious.

Or maybe Matlab’s lyap function is taking advantage of your c matrix being Hermitian (up to roundoff errors) and dispatching to a different algorithm?

Actually, it looks they are not both calling the same LAPACK routine here. Julia is using dtrsyl from LAPACK, but according to the Matlab lyap docs it “uses SLICOT routines SB03MD and SG03AD for Lyapunov equations”.

I think this information is false. The call after the reduction to Schur form is

% Solve Lyapunov Equation TA*X + X*TA' = QA'*C*QA.
   X = matlab.internal.math.sylvester_tri(TA, TA, CC, 'I', 'I', 'transp');

which calls an internal function. Behind this I suppose is dtrsyl from LAPACK. I checked with the MATLAB debugger that the resulting X is not exactly symmetric. SB03MD provides always a symmetric solution and, actually, exploits this property by determining only the elements in the upper triangle.

I measured the time spent only in the calls to LAPACK dtrsyl. For MATLAB I got

K>> tic; X = matlab.internal.math.sylvester_tri(TA, TA, CC, 'I', 'I', 'transp'); toc
Elapsed time is 0.047330 seconds.

For Julia

n = 1000; a, = schur(rand(n,n)); c = rand(n,n); c = c'*c;
julia> @time Y, scale = LAPACK.trsyl!('N', 'T', a, a, c);
  2.556617 seconds (7 allocations: 192 bytes)

which is about 50 times longer! Using more cores in Julia is not a big help

julia> LinearAlgebra.BLAS.set_num_threads(8)

julia> @time Y, scale = LAPACK.trsyl!('N', 'T', a, a, c);
  2.521782 seconds (7 allocations: 192 bytes)

Just checking, how exactly are you doing this?

That’s unfortunate. Downside of MATLAB being closed-source is relying on accurate documentation of builtins and how much more difficult it is to spot errors to report. Reproducing MATLAB’s good things is often unfeasible, which is part of the point I suppose. You know more about SLICOT-BasicControl’s inner workings, at least.

After starting Julia, the first command is
using MKL
and then loading other stuff.

The time to execute the ccall in the Julia wrapper trsyl! (see below)

            @time ccall((:dtrsyl_, libblastrampoline), Cvoid,
                (Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt}, Ref{BlasInt},
                 Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt},
                 Ptr{Float64}, Ref{BlasInt}, Clong, Clong),
                transa, transb, isgn, m, n,
                A, lda, B, ldb, C, ldc,
                scale, info, 1, 1)

is 2.534428 seconds (to be compared with the equivalent MATLAB call with
0.047330 seconds). This large gap is not anymore related to SLICOT, but is an interfacing problem of Julia with LAPACK. I wonder if I should open an issue for this or wait for opinions of somebody knowledgeable with BLAS/LAPACK interfacing aspects.

I would expect you to be using the BLAS/LAPACK from the updated MKL then. MATLAB may be doing something else as far as we know, but I doubt they’re making some proprietary software that surpasses MKL on an Intel CPU (which I assume you’re implying).

Something else I’m wondering, what system are you on (versioninfo)? Not sure if I’m misremembering something, but shouldn’t there be 64 in the BLAS/LAPACK names for 64-bit systems? That’s what the @blasfunc macro call is adding to the dtrsyl_ name in lapack.jl for me, anyway.

Doesn’t hurt to open an issue, at least another place to wait for opinions.

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

Also

julia> using MKL

julia> using LinearAlgebra

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
├ [ILP64] mkl_rt.2.dll
└ [ LP64] mkl_rt.2.dll

One possibility for higher speed in MATLAB would be, if the internal function used for test

K>> tic; X = matlab.internal.math.sylvester_tri(TA, TA, CC, 'I', 'I', 'transp'); toc
Elapsed time is 0.047330 seconds.

calls the new Level 3 LAPACK solver dtrsyl3 instead of dtrsyl, which is used in Julia’s lyap. Still, the huge gap (50 times speed difference) must have another explanation.

See also here for comparisons.

Did you try to profile Julia and Matlab with a profiler (e.g. VTune if you’re on x86_64) which can show you C functions? That should bring some insights, instead of doing guesswork.

1 Like