Julia can be dramatically slower than Matlab when solving 2D PDE

Hello,
It is said that Julia is very fast in comparison with other high level languages. SInce i do scientific computing, i was interested in this language. So i compared performance of Julia and Matlab when solving two-dimensional heat conduction equation. The results are really sad. Matlab code is as follows:

N = 101;
Fo = 1e-2;
h = 1/(N-1);
T=zeros(N,N);
Th = 0.5;
Tc = -0.5;
Time = 1000;
dt=0.1;
tic
for iter = 1:Time
Tn = T;
for i = 2:N-1
for j = 2:N-1
T(i,j) = Tn(i,j) + Fo * dt * ((Tn(i+1,j) - 2 * Tn(i,j) + Tn(i-1,j))/h+(Tn(i,j+1) - 2 * Tn(i,j) + Tn(i,j-1))/h);
end
end
T(1:N,1)=Th;
T(1:N,N)=Tc;
T(1,1:N)=T(2,1:N);
T(N,1:N)=T(N-1,1:N);
end
toc

The execution time is around 0.2 seconds.
The Julia code is as follows:

N = 101;
Fo = 0.01;
h = 1/(N-1);
T=zeros(N,N);
Th = 0.5;
Tc = -0.5;
Time = 1000;
dt=0.1;
function energy()
    for iter = 1:Time
Tn = T;
for i = 2:N-1
for j = 2:N-1
T[i,j] = Tn[i,j] + Fo * dt * ((Tn[i+1,j] - 2 * Tn[i,j] + Tn[i-1,j])/h + (Tn[i,j+1] - 2 * Tn[i,j] + Tn[i,j-1])/h);
end
end
T[1:N,1] .= Th;
T[1:N,N] .= Tc;
T[1,1:N] .= T[2,1:N];
T[N,1:N] .= T[N-1,1:N];
    end
end
@time energy()

The execution time is around 8.4 seconds.
I suspect that this huge performance gap is due to a naive code implementation in Julia. However, the interest was in direct code translation from Matlab to Julia. My questions are as follows:

  1. Will i get the same poor performance in Julia when using naive CUDA implementation without writing CUDA kernels?
  2. Can i increase Julia code performance without additional packages?
  3. I suspect that vectorization might help, however, i don’t feel that code becomes faster than matlab vectorized code. Am i wrong?
1 Like

Your code is hard to read with the formatting but it looks like you are using a lot of global variables in your code. Put everything in functions and pass the variables to the functions rather than using global variables.

function energy(Time, N, T, Th, Fo, dt, Tc, h)
    for iter = 1:Time
        Tn = T
        for i = 2:N-1
            for j = 2:N-1
                T[i, j] = Tn[i, j] + Fo * dt * ((Tn[i+1, j] - 2 * Tn[i, j] + Tn[i-1, j]) / h + (Tn[i, j+1] - 2 * Tn[i, j] + Tn[i, j-1]) / h)
            end
        end
        for j in 1:N # I converted this into a loop to fuse it all together
            T[j, 1] = Th
            T[j, N] = Tc
            T[1, j] = T[2, j]
            T[N, j] = T[N-1, j]
        end
    end
end

using BenchmarkTools # for benchmarking
@benchmark energy($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h)
BenchmarkTools.Trial: 84 samples with 1 evaluation.
 Range (min … max):  59.009 ms …  62.235 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     59.959 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   59.995 ms Β± 469.045 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

              ▁ ▁  ▁ β–†  ▁▁▃▃ β–† β–β–ƒβ–β–ˆ  ▁
  β–„β–„β–β–β–β–β–β–„β–β–β–„β–‡β–ˆβ–β–ˆβ–‡β–„β–ˆβ–‡β–ˆβ–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–„β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–„β–β–ˆβ–‡β–„β–„β–„β–‡β–β–β–„β–β–β–‡β–β–β–β–β–β–β–β–„β–„β–β–β–β–„ ▁
  59 ms           Histogram: frequency by time         61.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0

There are some other performance optimisations you could make but this is the most obvious one.

5 Likes

I’m presuming you’ve run this in the REPL or as a standalone file?

All the variables you define outside the energy function aren’t constant, the compiler has to treat them as unknown types and can’t compile efficient code. Trying your code, moving the variables into the energy function changes the runtime I see from 5.4s to 0.07s.

If you have a bunch of parameters like this, I’d recommend creating a β€œparameters struct” like this:

struct EnergyParameters
    N::Int
    Fo::Float64
    h::Float64
    Th::Float64
    Tc::Float64
end

Then you can define energy(params::EnergyParameters, time::Int, dt::Float64).

In other words, you likely want to start slapping const in front of global variables, or adjust your coding style :slight_smile:


Aside for other Julia users: I did notice the inner nested loop could be simplified to a broadcast:

@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)

but this actually worsens performance by a factor of ~2 :frowning:

3 Likes

Are you including compilation time in your comparison?

2 Likes

Dear Daniel,
Many thanks for your answer. It is very interesting to note that your modification where you converted a part of the code into a loop makes the code as fast as Matlab code. Moreover, if we increase the number of iteration, Julia outperforms Matlab. Why additional loop made it faster :thinking: . I thought we should avoid loops for performance issues :thinking:

2 Likes

You want to avoid loops in MATLAB since it ends up calling code not implemented in MATLAB, giving some overhead. This is not the case in Julia - your .= calls are implemented as loops in Julia, so that means what I was doing was simply converting 4 separate loops into a single loop. (Sometimes, trying to go too far with vectorisation in Julia will just make things slower even.)

7 Likes

Dear tecosaur,
Many thanks for your answer. I wrote the code in Jupyter notebook. It seems like Julia operates faster with loops, because Daniel modification with additional loop significantly speed up the code. And your modification proved that, because as you wrote the performance was worsen when removing the loops

1 Like

Thanks a lot!

For more performance tips you can check out this page: Performance Tips Β· The Julia Language

If it looks a bit too daunting, I made what I hope is a more accessible tutorial last week: https://gdalle.github.io/JuliaOptimizationDays2024-FastJulia/

13 Likes

Dear TimG,
I simply used tic toc in Matlab and @time in Julia. Probably you referred to btime. I tried it too and results were the same.

Dear gdalle,
Thanks for the link. I will check it

1 Like

It’s slower due to allocation of array slices, needs @views to avoid copying data:

@. @views T[2:N-1,2:N-1] = ...
10 Likes

No, actually, it is eliminating references to global variables that speeds it up. A slight optimization that comes from the additional loop is avoiding allocations in expressions like T[1,1:N] .= T[2,1:N]. The rhp allocates a temporary array in Julia, to get the array view like in Matlab you need to explicitly write T[1,1:N] .= @view T[2,1:N].

2 Likes

Dear Vasily,
Many thanks for this explanation! Julia is hard :face_with_diagonal_mouth:

There is probably a misconception on what Tn = T does here. It just assigns the new label Tn to the same data as T, so this line would be meaningless if we just replaced all the Tn by T in the subsequent lines.

I assume that that is wrong, and probably you wanted to copy the data in T to a new array Tn.

Apart from that, you can also add @views in front of the whole function to get the β€œslices do not allocate behavior”. With these changes, we get the following code (where the inner loops are not expanded):

julia> @views function energy2(Time, N, T, Th, Fo, dt, Tc, h)
           Tn = similar(T)
           for iter = 1:Time
               Tn .= T
               @. T[2:N-1,2:N-1] += (Fo * dt / h) * ((Tn[3:N,2:N-1]) - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2])
               T[1:N,1] .= Th
               T[1:N,N] .= Tc
               T[1,1:N] .= T[2,1:N]
               T[N,1:N] .= T[N-1,1:N]
           end
           return T
       end
energy2 (generic function with 1 method)

julia> @benchmark energy2($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h)
BenchmarkTools.Trial: 272 samples with 1 evaluation.
 Range (min … max):  17.873 ms …  19.690 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     18.359 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   18.435 ms Β± 365.780 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–‡ β–…  β–‚ β–‚β–‚ β–β–ˆβ–…β–‚β–†β–ƒβ–„ β–‚  ▂▁▂                                  
  β–„β–ˆβ–ƒβ–†β–ˆβ–ˆβ–ˆβ–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–…β–ˆβ–ˆβ–ˆβ–ƒβ–‡β–‡β–ˆβ–…β–ƒβ–…β–ƒβ–†β–…β–β–‡β–„β–…β–β–…β–†β–…β–…β–‡β–ƒβ–ƒβ–„β–ƒβ–„β–ƒβ–ƒβ–β–β–ƒβ–β–ƒ β–„
  17.9 ms         Histogram: frequency by time         19.4 ms <

 Memory estimate: 79.80 KiB, allocs estimate: 2.

Which is equally fast than the explicit-loop alternative with the same correction (although their are not producing the same output, probably because some operation in that broadcasted expression is wrong - honestly I find the loop easier to read).

A subtlety appeared here in the version with the @. T[2:N-1,2:N-1] += (Fo * dt / h) ... broadcast, when the (wrong?) Tn = T was used: since those arrays, T and Tn were the same, the broadcasting operation was not able to elide all allocations of intermediates, thus we still saw allocations before fixing the issue with the array copying.

2 Likes

Actually the most important rule in Julia is not having global variables. That is, if you just get your original code without modifications whatsoever, and put everything inside a function, you get an enormous boost in performance:

julia> function main()
       N = 101;
       Fo = 0.01;
       h = 1/(N-1);
       T=zeros(N,N);
       Th = 0.5;
       Tc = -0.5;
       Time = 1000;
       dt=0.1;
       function energy()
           for iter = 1:Time
       Tn = T;
       for i = 2:N-1
       for j = 2:N-1
       T[i,j] = Tn[i,j] + Fo * dt * ((Tn[i+1,j] - 2 * Tn[i,j] + Tn[i-1,j])/h + (Tn[i,j+1] - 2 * Tn[i,j] + Tn[i,j-1])/h);
       end
       end
       T[1:N,1] .= Th;
       T[1:N,N] .= Tc;
       T[1,1:N] .= T[2,1:N];
       T[N,1:N] .= T[N-1,1:N];
           end
       end
       @time energy()
       end
main (generic function with 1 method)

julia> main()
  0.054220 seconds (2.00 k allocations: 1.709 MiB)

versus ~4 s for running directly the same code without the function, copying and pasting it to the REPL.

Then if you add @views to the whole main() function and fix the Tn = T issue, you get the same performance as the best of the above:

julia> @views function main()
       N = 101;
       Fo = 0.01;
       h = 1/(N-1);
       T=zeros(N,N);
       Th = 0.5;
       Tc = -0.5;
       Time = 1000;
       dt=0.1;
       function energy()
       Tn = copy(T)
       for iter = 1:Time
       Tn .= T;
       for i = 2:N-1
       for j = 2:N-1
       T[i,j] = Tn[i,j] + Fo * dt * ((Tn[i+1,j] - 2 * Tn[i,j] + Tn[i-1,j])/h + (Tn[i,j+1] - 2 * Tn[i,j] + Tn[i,j-1])/h);
       end
       end
       T[1:N,1] .= Th;
       T[1:N,N] .= Tc;
       T[1,1:N] .= T[2,1:N];
       T[N,1:N] .= T[N-1,1:N];
           end
       end
       energy()
       end
main (generic function with 1 method)

julia> @btime main()
  17.368 ms (4 allocations: 159.59 KiB)

So there are a few things we must learn to use Julia properly, but it is not a insurmountable barrier. Your original code was almost there.

15 Likes

Dear Imiq,
Many thanks for your responce. T is current temperature whereas the Tn is temperature at the previous time layer (previous iteration). It is explicit Euler numerical scheme which is fully correct in Matlab. And it seems correct in Julia too since it reproduces the results similar to Matlab (i only plotted contours without quantitative evaluation). I got that i should change the style of code writing in Julia as tecosaur suggested

As a side note, you can probably get dramatically faster simulation time by using a better integrator OrdinaryDiffEq.jl makes it very easy to swap out the integrator to something like Tsit5 which should give you much better accuracy/speed tradeoff

4 Likes

Just check it carefully, because in Julia T = Tn does not copy the array (just assigns a new label to it), while in Matlab (it seems) it does copy the array. Here it does not make a difference, it seems, but the semantics is very different and can cause many issues in other contexts.

Note that in the codes posted the results differ if we change T = Tn to:

           for iter = 1:Time
               Tn = copy(T) # instead of T = Tn

and I feel that the correct there is the copy, because otherwise you are modifying the current temperature in each iteration of the subsequent loop.

(and if the copy is necessary there, it is better to create the Tn array outside the loop and just update it in place with Tn .= T there).

2 Likes

Got it! Many thanks for this important note.I will do a quantitative evaluation in this case.