MATLAB outperforms Julia (20 times faster) running this nested loop

SantiagoOrtiz · January 9, 2023, 2:46am

Hello Fellas,

I am relatively new to Julia Programming. So far, Julia’s high performance has caught my attention. However, it was disappointing to see that MATLAB performs 20 times faster in executing the following program. I tested this program on two different workstations, getting consistent results.

Context: This program solves Laplace’s Equation using the (Central) Finite difference method in a nested Loop fashion. The program uses elementary programming structures (no packages are imported !).

n = 100
T = zeros(n, n)
Tₖ₋₁ = zeros(n, n)
T[1, 1:n] .= 500
T[n, 1:n] .= 500
T[1:n, 1] .= 200
T[1:n, n] .= 500

error = 1.0;
tolerance = 1E-6
@time begin
    while error > tolerance
        Tₖ₋₁ .= T
        for i = 2:n-1
            for j = 2:n-1
                T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
            end
        end
        global error = maximum(abs.(Tₖ₋₁ .- T))
    end
end

The MATLAB version is a literal translation of the Julia code and can be downloaded from (MATLAB laplacEq - Google Drive)

Results:

Julia (1.8.3) → 35.892180 seconds (901.65 M allocations: 16.409 GiB, 2.00% gc time, 0.62% compilation time)

MATLAB (R2022b) → Elapsed time is 1.578888 seconds.

There are two relevant questions arising from this problem:

Why is Julia outperformed by MATLAB in this case?
What can be done to improve Julia’s performance?

Since this is a fairly simple program with no dependency on additional packages, solving this puzzle might help new Julia users (like me) to develop Julia code with good programming practices, building up a strong foundation from the start.

All the best,
Santiago

Jeff_Emanuel · January 9, 2023, 2:50am

https://docs.julialang.org/en/v1/manual/performance-tips/#Performance-critical-code-should-be-inside-a-function

stevengj · January 9, 2023, 3:34am

Yup, just putting the Julia code in a function (and fixing the loop order) makes it more than 30x faster:

function doit!(T, Tₖ₋₁=copy(T); tolerance=1e-6)
    m, n = size(T)
    error = 1.0
    while error > tolerance
        Tₖ₋₁ .= T
        for j = 2:m-1, i = 2:n-1
            T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
        end
        error = maximum(abs.(Tₖ₋₁ .- T))
    end
    return T
end

But then you can make another 2x faster by completely eliminating the Tₖ₋₁ array and a couple of other tricks:

function doit2!(T; tolerance=1e-6)
    m, n = size(T)
    while true
        error = zero(eltype(T))
        @inbounds for j = 2:m-1, i = 2:n-1
            Tᵢⱼ = T[i, j]
            T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
            error = max(error, abs(T[i, j] - Tᵢⱼ))
        end
        error ≤ tolerance && break
    end
    return T
end

and I get another factor of 3 by using the LoopVectorization package:

using LoopVectorization
function doit3!(T; tolerance=1e-6)
    m, n = size(T)
    while true
        error = zero(eltype(T))
        @turbo for j = 2:m-1, i = 2:n-1
            Tᵢⱼ = T[i, j]
            T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
            error = max(error, abs(T[i, j] - Tᵢⱼ))
        end
        error ≤ tolerance && break
    end
    return T
end

I would benchmark with something like:

using BenchmarkTools
@btime doit3!($T; tolerance=1e-6) setup=(T[2:n-1,2:n-1] .= 0);

to get accurate numbers. (If you use the more primitive @time, be sure to run the benchmark a couple of times at least, since the first run you are measuring compilation time.)

Moral of this story: the first Julia code you write is almost always quite slow, especially if you haven’t yet grokked the performance tips. But you can usually make it much faster using only localized tweaks (as opposed to throwing your code in the trash and rewriting it in Fortran).

(Of course, the code can be made faster yet because Gauss–Seidel is a terrible way to solve Laplace’s equation, but that would be “cheating”.)

cmichelenstrofer · January 9, 2023, 3:52am

What is loop vectorization doing under the hood? Would it be better practice to code in vectorized form instead of using for loops?

mkitti · January 9, 2023, 4:13am

That’s a lot of potential globals there. Have you tried throwing all of this into a function? When I do, I get faster times on old hardware.

julia> function santiago()

       n = 100
       W = 20
       H = 20
       δx = W/n
       δy = H/n
       x = 0:δx:W
       y = 0:δy:H

       T = zeros(n, n)
       Tₖ₋₁ = zeros(n, n)
       T[1, 1:n] .= 500
       T[n, 1:n] .= 500
       T[1:n, 1] .= 200
       T[1:n, n] .= 500

       error = 1.0;
       tolerance = 1E-6
       @time begin
           while error > tolerance
               Tₖ₋₁ .= T
               for i = 2:n-1
                   for j = 2:n-1
                       T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
                   end
               end
               error = maximum(abs.(Tₖ₋₁ .- T))
           end
       end
       return T

       end
santiago (generic function with 1 method)

julia> T = santiago();
  1.410719 seconds (26.72 k allocations: 1019.899 MiB, 2.50% gc time)

jishnub · January 9, 2023, 4:15am

Possibly tangential, but why isn’t performance in Matlab affected by global variables in the same way as in Julia?

mkitti · January 9, 2023, 4:41am

I’m taking a wild guess here, but Julia has thread-based concurrency for all functions. MATLAB only introduced threads recently and it is only limited to several builtin functions.

If you only have to worry about one piece of code modifying a set of globals at a time, you can make a fair number of assumptions that we cannot in Julia. Binding globals to types is one thing that really accelerates the use of globals in Julia.

SantiagoOrtiz · January 9, 2023, 5:11am

This is awesome!
Thank you all for following up—specially @stevengj, for your timeline response. As a new Julia Programming user, I have learned much from this discussion.

To summarize, the piece of code below is the fastest so far.

My initial “lazy” Julia code: 35.892180 seconds (901.65 M allocations: 16.409 GiB)
The MATLAB literal translation: 1.578888 seconds
The “turbo” Julia code: 0.072218 seconds (2 allocations: 78.17 KiB)

using LoopVectorization

function doit3!(n; tolerance=1e-6)

    T = zeros(n, n)
    T[1, 1:n] .= 500
    T[n, 1:n] .= 500
    T[1:n, 1] .= 200
    T[1:n, n] .= 500

    while true
        error = zero(eltype(T))
        @turbo for j = 2:n-1, i = 2:n-1
            Tᵢⱼ = T[i, j]
            T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
            error = max(error, abs(T[i, j] - Tᵢⱼ))
        end
        error ≤ tolerance && break
    end
    return T
end

The initial motivation of the post was to see that MATLAB performed better than an identical implementation in Julia. Now I understand that to take full advantage of Julia’s capabilities, in some cases, one will have to go a bit deeper into the language.

Note that:

The appropriateness of the numerical method implemented here is separate from this discussion.
The MATLAB code can also be modified to achieve better performance.

All the best,
Santiago

kristoffer.carlsson · January 9, 2023, 5:57am

Do you have the modified MATLAB code that gives the improved performance?

Elrod · January 9, 2023, 6:03am

Hi! I’m the author of LoopVectorization.jl.

The “Vectorization” in LoopVectorization is not about “vectorized form”, but about “vector” (i.e. SIMD) CPU instructions. SIMD means “Single Instruction Multiple Data”, i.e. LoopVectorization.jl makes each CPU instruction operate on multiple loop iterations at a time.
Note that even without @turbo Julia will use SIMD/vector instructions. @turbo just tries hard to do better than the default autovectorizer that is used by Julia (and other languages like c++ or Rust).

TLDR: The “Vectorization” in “LoopVectorization” is a low level implementation detail that in the future won’t be exposed (I’m rewriting it, and the new version will get a new name).

DNF · January 9, 2023, 6:28am

(Edit: Oops, missed that the author of Loopvectorization.jl had already answered this question )

Loopvectorization.jl is concerned with simd vectorization, which is not the same as “vectorized style”.

You can use Loopvectorization.jl on either loops or dot-broadcasted expressions, and, as you can see, the given example was a loop.

Elrod · January 9, 2023, 6:41am

Ironically, the vectorized style is actually harder to vectorize because of the semantics of broadcasting.
That’s why (for example) FastBroadcast.jl defaults to not actually broadcasting, you need to pass an argument to enable that when using FastBroascast.@...

DNF · January 9, 2023, 6:44am

There’s no multithreading in this Julia code.

Matlab simply does not appear to be affected by type instabilities. Traditionally, this was because it was interpreted; today it is jit-compiled, but that is closed source and little is known about the details of how it works, but I believe it is ‘classical’ jit compilation, which possibly means it happens at a later stage, and may have access to more information than Julia’s ‘(just) ahead-of-time’ compiler.

mkitti · January 9, 2023, 7:06am

It doesn’t matter that there is no multithreading here. The semantics of Julia mean that we cannot make certain assumptions about globals without a type assertion, in part due to multithreading. MATLAB can make assumptions about globals that Julia cannot, in part because multithreading is not generally part of the language. In MATLAB, you can assume that if the type of a global changes, it is because the block of code you are executing changed it.

Do you have evidence that MATLAB jit is not affected by type instability? It seems that MATLAB performance with globals may be degrading.

DNF · January 9, 2023, 7:41am

But the performance issue of non-const globals predates multithreading in Julia, afaik. And I’ve not heard that as the main explanation, anyway.

I said ‘appears’. And also, that’s what the OP is about, Matlab not suffering from the type instability, so the OP itself is evidence.

Anyway, I’ll add that your explanation does in fact make sense.

giordano · January 9, 2023, 7:43am

Isn’t everything a matrix in Matlab?

DNF · January 9, 2023, 7:55am

I followed that link now, and I don’t think it supports the hypothesis that there are increasing performance issues with globals in Matlab. The person asking the question thinks so, but the other posters (including staff) could not reproduce.

mkitti · January 9, 2023, 8:11am

They also have a class, equivalent to “element type”.

>> a = 1

a =

     1

>> class(a)

ans =

    'double'

>> size(a)

ans =

     1     1

>> b = uint16(2)

b =

  uint16

   2

>> whos
  Name      Size            Bytes  Class     Attributes

  a         1x1                 8  double              
  ans       1x2                16  double              
  b         1x1                 2  uint16

RoyiAvital · January 9, 2023, 8:49am

If we put LoopVectorization aside for a moment, this is what I get when I apply the same on both codes (Changing the order, which MATLAB and Julia share):

function [ mT ] = SolveLaplace( mT, numItr )

[numRows, numCols] = size(mT);

for kk = 1:numItr
    for jj = 2:(numCols - 1)
        for ii = 2:(numRows - 1)
            mT(ii, jj) = 0.25 * (mT(ii - 1, jj) + mT(ii + 1, jj) + mT(ii, jj - 1) + mT(ii, jj + 1));
        end
    end
end

end

For Julia:

function SolveLaplace( mT, numItr )
    numRows, numCols = size(mT);

    for kk in 1:numItr
        for jj in 2:(numCols - 1)
            for ii in 2:(numRows - 1)
                mT[ii, jj] = 0.25 * (mT[ii - 1, jj] + mT[ii + 1, jj] + mT[ii, jj - 1] + mT[ii, jj + 1]);
            end
        end
    end
    return mT;
end

Running this in MATLAB:

clear();

n = 100;
T = zeros(n);
T(1, 1:n) = 200;
T(n, 1:n) = 500;
T(1:n, 1) = 500;
T(1:n, n) = 500;

numItr = 1000;

hF = @() SolveLaplace(T, numItr);

TimeItMin(hF)

Will yield: ans = 0.0422.

In Julia @btime SolveLaplace($T, $numItr); will yield 41.179 ms (0 allocations: 0 bytes).

So they both, probably, yield equivalent machine code for those simple loops.

Remark
The function TimeItMin() is the same as MATLAB’s timeit() with a single modification I made which is using the minimum run time instead of the median in order to match Julia’s @btime.

stevengj · January 9, 2023, 2:45pm

The ! at the end of the function name is a Julia convention for functions that modify (“mutate”) an argument in-place. I used it my code because I was passing T as an argument and modifying it in the function. If you rewrite the function to take n as an argument then you shouldn’t use !. Again, this is purely a convention — you can name your functions whatever you want, but if you’re going to be using Julia seriously you’ll want to understand the ! meaning.

Conceptually, I think it is cleaner to separate the code that allocates T and initializes it to a particular set of boundary conditions and initial values from the code that solves Laplace’s equation.

For benchmarking, neither your original Matlab code nor your original Julia code include the initialization of T in the timing. Of course, you can put a @time call inside the function, but from coding standpoint it is usually better to separate benchmarking from implementation. (Especially if you want to use a package like BenchmarkTools.jl to get more reliable timing statistics, or do more fine-grained performance profiling.)

Topic		Replies	Views
Matlab versus Julia General Usage	33	4943	July 15, 2021
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1894	June 25, 2023
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8398	March 5, 2017
Porting code from MatLab - performance tips New to Julia	18	425	June 26, 2024
Julia can be dramatically slower than Matlab when solving 2D PDE New to Julia performance , matlab , pde	42	1217	November 9, 2024

MATLAB outperforms Julia (20 times faster) running this nested loop

Related topics