MATLAB outperforms Julia (20 times faster) running this nested loop

Hello Fellas,

I am relatively new to Julia Programming. So far, Julia’s high performance has caught my attention. However, it was disappointing to see that MATLAB performs 20 times faster in executing the following program. I tested this program on two different workstations, getting consistent results.

Context: This program solves Laplace’s Equation using the (Central) Finite difference method in a nested Loop fashion. The program uses elementary programming structures (no packages are imported !).


n = 100
T = zeros(n, n)
Tₖ₋₁ = zeros(n, n)
T[1, 1:n] .= 500
T[n, 1:n] .= 500
T[1:n, 1] .= 200
T[1:n, n] .= 500

error = 1.0;
tolerance = 1E-6
@time begin
    while error > tolerance
        Tₖ₋₁ .= T
        for i = 2:n-1
            for j = 2:n-1
                T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
            end
        end
        global error = maximum(abs.(Tₖ₋₁ .- T))
    end
end

The MATLAB version is a literal translation of the Julia code and can be downloaded from (MATLAB laplacEq - Google Drive)


Results:

Julia (1.8.3) → 35.892180 seconds (901.65 M allocations: 16.409 GiB, 2.00% gc time, 0.62% compilation time)

MATLAB (R2022b) → Elapsed time is 1.578888 seconds.


There are two relevant questions arising from this problem:

  1. Why is Julia outperformed by MATLAB in this case?
  2. What can be done to improve Julia’s performance?

Since this is a fairly simple program with no dependency on additional packages, solving this puzzle might help new Julia users (like me) to develop Julia code with good programming practices, building up a strong foundation from the start.

All the best,
Santiago

1 Like

https://docs.julialang.org/en/v1/manual/performance-tips/#Performance-critical-code-should-be-inside-a-function

7 Likes

Yup, just putting the Julia code in a function (and fixing the loop order) makes it more than 30x faster:

function doit!(T, Tₖ₋₁=copy(T); tolerance=1e-6)
    m, n = size(T)
    error = 1.0
    while error > tolerance
        Tₖ₋₁ .= T
        for j = 2:m-1, i = 2:n-1
            T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
        end
        error = maximum(abs.(Tₖ₋₁ .- T))
    end
    return T
end

But then you can make another 2x faster by completely eliminating the Tₖ₋₁ array and a couple of other tricks:

function doit2!(T; tolerance=1e-6)
    m, n = size(T)
    while true
        error = zero(eltype(T))
        @inbounds for j = 2:m-1, i = 2:n-1
            Tᵢⱼ = T[i, j]
            T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
            error = max(error, abs(T[i, j] - Tᵢⱼ))
        end
        error ≤ tolerance && break
    end
    return T
end

and I get another factor of 3 by using the LoopVectorization package:

using LoopVectorization
function doit3!(T; tolerance=1e-6)
    m, n = size(T)
    while true
        error = zero(eltype(T))
        @turbo for j = 2:m-1, i = 2:n-1
            Tᵢⱼ = T[i, j]
            T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
            error = max(error, abs(T[i, j] - Tᵢⱼ))
        end
        error ≤ tolerance && break
    end
    return T
end

I would benchmark with something like:

using BenchmarkTools
@btime doit3!($T; tolerance=1e-6) setup=(T[2:n-1,2:n-1] .= 0);

to get accurate numbers. (If you use the more primitive @time, be sure to run the benchmark a couple of times at least, since the first run you are measuring compilation time.)

  • Moral of this story: the first Julia code you write is almost always quite slow, especially if you haven’t yet grokked the performance tips. But you can usually make it much faster using only localized tweaks (as opposed to throwing your code in the trash and rewriting it in Fortran).

(Of course, the code can be made faster yet because Gauss–Seidel is a terrible way to solve Laplace’s equation, but that would be “cheating”.)

38 Likes

What is loop vectorization doing under the hood? Would it be better practice to code in vectorized form instead of using for loops?

That’s a lot of potential globals there. Have you tried throwing all of this into a function? When I do, I get faster times on old hardware.

julia> function santiago()

       n = 100
       W = 20
       H = 20
       δx = W/n
       δy = H/n
       x = 0:δx:W
       y = 0:δy:H

       T = zeros(n, n)
       Tₖ₋₁ = zeros(n, n)
       T[1, 1:n] .= 500
       T[n, 1:n] .= 500
       T[1:n, 1] .= 200
       T[1:n, n] .= 500

       error = 1.0;
       tolerance = 1E-6
       @time begin
           while error > tolerance
               Tₖ₋₁ .= T
               for i = 2:n-1
                   for j = 2:n-1
                       T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
                   end
               end
               error = maximum(abs.(Tₖ₋₁ .- T))
           end
       end
       return T

       end
santiago (generic function with 1 method)

julia> T = santiago();
  1.410719 seconds (26.72 k allocations: 1019.899 MiB, 2.50% gc time)

Possibly tangential, but why isn’t performance in Matlab affected by global variables in the same way as in Julia?

1 Like

I’m taking a wild guess here, but Julia has thread-based concurrency for all functions. MATLAB only introduced threads recently and it is only limited to several builtin functions.

If you only have to worry about one piece of code modifying a set of globals at a time, you can make a fair number of assumptions that we cannot in Julia. Binding globals to types is one thing that really accelerates the use of globals in Julia.

This is awesome!
Thank you all for following up—specially @stevengj, for your timeline response. As a new Julia Programming user, I have learned much from this discussion.

To summarize, the piece of code below is the fastest so far.

My initial “lazy” Julia code: 35.892180 seconds (901.65 M allocations: 16.409 GiB)
The MATLAB literal translation: 1.578888 seconds
The “turbo” Julia code: 0.072218 seconds (2 allocations: 78.17 KiB)

using LoopVectorization

function doit3!(n; tolerance=1e-6)

    T = zeros(n, n)
    T[1, 1:n] .= 500
    T[n, 1:n] .= 500
    T[1:n, 1] .= 200
    T[1:n, n] .= 500

    while true
        error = zero(eltype(T))
        @turbo for j = 2:n-1, i = 2:n-1
            Tᵢⱼ = T[i, j]
            T[i, j] = 1/4*(T[i,j+1] + T[i,j-1] + T[i+1,j] + T[i-1,j])
            error = max(error, abs(T[i, j] - Tᵢⱼ))
        end
        error ≤ tolerance && break
    end
    return T
end

The initial motivation of the post was to see that MATLAB performed better than an identical implementation in Julia. Now I understand that to take full advantage of Julia’s capabilities, in some cases, one will have to go a bit deeper into the language.

Note that:

  1. The appropriateness of the numerical method implemented here is separate from this discussion.
  2. The MATLAB code can also be modified to achieve better performance.

All the best,
Santiago

5 Likes

Do you have the modified MATLAB code that gives the improved performance?

Hi! I’m the author of LoopVectorization.jl.

The “Vectorization” in LoopVectorization is not about “vectorized form”, but about “vector” (i.e. SIMD) CPU instructions. SIMD means “Single Instruction Multiple Data”, i.e. LoopVectorization.jl makes each CPU instruction operate on multiple loop iterations at a time.
Note that even without @turbo Julia will use SIMD/vector instructions. @turbo just tries hard to do better than the default autovectorizer that is used by Julia (and other languages like c++ or Rust).

TLDR: The “Vectorization” in “LoopVectorization” is a low level implementation detail that in the future won’t be exposed (I’m rewriting it, and the new version will get a new name).

29 Likes

(Edit: Oops, missed that the author of Loopvectorization.jl had already answered this question :smile:)

Loopvectorization.jl is concerned with simd vectorization, which is not the same as “vectorized style”.

You can use Loopvectorization.jl on either loops or dot-broadcasted expressions, and, as you can see, the given example was a loop.

4 Likes

Ironically, the vectorized style is actually harder to vectorize because of the semantics of broadcasting.
That’s why (for example) FastBroadcast.jl defaults to not actually broadcasting, you need to pass an argument to enable that when using FastBroascast.@...

4 Likes

There’s no multithreading in this Julia code.

Matlab simply does not appear to be affected by type instabilities. Traditionally, this was because it was interpreted; today it is jit-compiled, but that is closed source and little is known about the details of how it works, but I believe it is ‘classical’ jit compilation, which possibly means it happens at a later stage, and may have access to more information than Julia’s ‘(just) ahead-of-time’ compiler.

It doesn’t matter that there is no multithreading here. The semantics of Julia mean that we cannot make certain assumptions about globals without a type assertion, in part due to multithreading. MATLAB can make assumptions about globals that Julia cannot, in part because multithreading is not generally part of the language. In MATLAB, you can assume that if the type of a global changes, it is because the block of code you are executing changed it.

Do you have evidence that MATLAB jit is not affected by type instability? It seems that MATLAB performance with globals may be degrading.

3 Likes

But the performance issue of non-const globals predates multithreading in Julia, afaik. And I’ve not heard that as the main explanation, anyway.

I said ‘appears’. And also, that’s what the OP is about, Matlab not suffering from the type instability, so the OP itself is evidence.

Anyway, I’ll add that your explanation does in fact make sense.

Isn’t everything a matrix in Matlab?

I followed that link now, and I don’t think it supports the hypothesis that there are increasing performance issues with globals in Matlab. The person asking the question thinks so, but the other posters (including staff) could not reproduce.

They also have a class, equivalent to “element type”.

>> a = 1

a =

     1

>> class(a)

ans =

    'double'

>> size(a)

ans =

     1     1

>> b = uint16(2)

b =

  uint16

   2

>> whos
  Name      Size            Bytes  Class     Attributes

  a         1x1                 8  double              
  ans       1x2                16  double              
  b         1x1                 2  uint16             

If we put LoopVectorization aside for a moment, this is what I get when I apply the same on both codes (Changing the order, which MATLAB and Julia share):

function [ mT ] = SolveLaplace( mT, numItr )

[numRows, numCols] = size(mT);

for kk = 1:numItr
    for jj = 2:(numCols - 1)
        for ii = 2:(numRows - 1)
            mT(ii, jj) = 0.25 * (mT(ii - 1, jj) + mT(ii + 1, jj) + mT(ii, jj - 1) + mT(ii, jj + 1));
        end
    end
end

end

For Julia:

function SolveLaplace( mT, numItr )
    numRows, numCols = size(mT);

    for kk in 1:numItr
        for jj in 2:(numCols - 1)
            for ii in 2:(numRows - 1)
                mT[ii, jj] = 0.25 * (mT[ii - 1, jj] + mT[ii + 1, jj] + mT[ii, jj - 1] + mT[ii, jj + 1]);
            end
        end
    end
    return mT;
end

Running this in MATLAB:

clear();

n = 100;
T = zeros(n);
T(1, 1:n) = 200;
T(n, 1:n) = 500;
T(1:n, 1) = 500;
T(1:n, n) = 500;

numItr = 1000;

hF = @() SolveLaplace(T, numItr);

TimeItMin(hF)

Will yield: ans = 0.0422.

In Julia @btime SolveLaplace($T, $numItr); will yield 41.179 ms (0 allocations: 0 bytes).

So they both, probably, yield equivalent machine code for those simple loops.

Remark
The function TimeItMin() is the same as MATLAB’s timeit() with a single modification I made which is using the minimum run time instead of the median in order to match Julia’s @btime.

9 Likes

The ! at the end of the function name is a Julia convention for functions that modify (“mutate”) an argument in-place. I used it my code because I was passing T as an argument and modifying it in the function. If you rewrite the function to take n as an argument then you shouldn’t use !. Again, this is purely a convention — you can name your functions whatever you want, but if you’re going to be using Julia seriously you’ll want to understand the ! meaning.

Conceptually, I think it is cleaner to separate the code that allocates T and initializes it to a particular set of boundary conditions and initial values from the code that solves Laplace’s equation.

For benchmarking, neither your original Matlab code nor your original Julia code include the initialization of T in the timing. Of course, you can put a @time call inside the function, but from coding standpoint it is usually better to separate benchmarking from implementation. (Especially if you want to use a package like BenchmarkTools.jl to get more reliable timing statistics, or do more fine-grained performance profiling.)

5 Likes